SK/MS/Nexus Down

Incident Report for Poolplayers

Postmortem

Summary

On Tuesday evening, September 30th, several of our services, including Scorekeeper, Member Services, and Nexus, were unavailable from 8:33 PM to 9:19 PM CDT.

What Happened?

The outage occurred because a core part of our system ran out of disk space. This component employs a technique called idempotency, which serves as a safeguard to ensure that every action taken in Scorekeeper, such as Turn Over, occurs exactly once, even if the request is sent multiple times.

Due to the rapid growth in Scorekeeper usage from new leagues and Tri-Annual tournament testing, this system generated data faster than we anticipated, and the storage filled up. This prevented any new information from being processed, resulting in our applications becoming unavailable. A monitoring system that should have warned us about the low storage failed to send an alert.

How We Fixed It

Our team received the first notification at 8:33 PM and immediately began investigating. The underlying storage issue was resolved by 9:00 PM. It then took an additional 19 minutes to restart our applications and fully restore service to all users.

Preventing This in the Future

  1. Improve Our Warning Systems: We are conducting a full review of our monitoring systems to find out why the alert failed and to ensure we are appropriately notified of potential issues well before they can cause an outage.
  2. Increase Service Resiliency: Our applications should recover much faster. We are making significant improvements to ensure that if a temporary database issue occurs, it can bounce back almost instantly once the database is back online, rather than requiring a lengthy restart.

We sincerely apologize for the disruption this caused to our users and the impact it had on our operations. We are committed to learning from this incident and improving the reliability of our platform.

Full Technical Details:

Event Summary

  • Outage Duration: 9/30/25 from 8:33 PM CDT to 9:19 PM CDT (46 minutes).
  • Impacted Services: Scorekeeper, Member Services, Nexus.
  • Root Cause: The disk volume hosting the production database's idempotency table reached 100% capacity. This prevented SQL Server from executing a file growth operation, causing database write requests to time out and leading to a cascading failure in the GraphQL(API) services, including our authentication service.

Timeline of Events

  • 8:33 PM CDT: First automated notification received; monitoring systems detect service unavailability. The outage begins.
  • 8:40 PM CDT: The engineering team is engaged and begins active investigation.
  • 9:00 PM CDT: The disk space issue on the production database is resolved by adding capacity. The database server returns to a healthy state.
  • 9:19 PM CDT: GraphQL servers are successfully restarted and resume serving traffic. Service is fully restored.

Root Cause Analysis

The immediate cause was a full disk volume, but two contributing factors led to the outage:

  1. Accelerated Data Growth: Our standard 14-day retention policy for the idempotency table became unsustainable due to a recent, rapid increase in Scorekeeper volume from wider league adoption and new tournament usage.
  2. Monitoring Failure: Configured alerts for low disk space on this volume failed to trigger. This lack of proactive notification turned a manageable capacity risk into a full-blown outage.

Extended Recovery Time

The 19-minute recovery time (from 9:00 PM to 9:19 PM) after the database was healthy is unacceptable. The GraphQL servers were not resilient to a prolonged database outage and failed to recover gracefully, necessitating a manual restart.

Action Items

  1. Immediate Remediation (Completed):

    1. The idempotency table data retention policy has been reduced from 14 days to 4 days.
    2. The affected disk volume size has been significantly increased.
  2. Action Item 1: Fix Alerting (In Progress):

    1. Task: Conduct a thorough investigation into why the disk-space-full alerts did not fire. Validate thresholds, notification channels, and end-to-end functionality of the monitoring agent.
    2. Goal: Ensure we receive critical alerts with enough lead time to prevent future capacity-related outages.
  3. Action Item 2: Improve Service Resiliency (In Progress):

    1. Task: Improve the GraphQL services to be more resilient to transient or extended database outages.
    2. Goal: Services should enter a degraded state but recover automatically and gracefully within seconds of the database becoming available, without requiring manual intervention.
Posted Oct 01, 2025 - 13:36 CDT

Resolved

This incident has been resolved. We apologize for the inconvenience.
Posted Sep 30, 2025 - 21:20 CDT

Update

We are seeing some stabilization, but are continuing to investigate. Thank you all for your patience while we work through this issue.
Posted Sep 30, 2025 - 21:05 CDT

Update

We are continuing to investigate this issue.
Posted Sep 30, 2025 - 20:55 CDT

Investigating

We are aware that the SK app is throwing a "Bad Gateway" error. The Tech Team is investigating the issue. Stay tuned for updates!
Posted Sep 30, 2025 - 20:46 CDT
This incident affected: Scorekeeper, Member Services (Pool League) and Franchise Services (Nexus, Resource Library).