On Tuesday evening, September 30th, several of our services, including Scorekeeper, Member Services, and Nexus, were unavailable from 8:33 PM to 9:19 PM CDT.
The outage occurred because a core part of our system ran out of disk space. This component employs a technique called idempotency, which serves as a safeguard to ensure that every action taken in Scorekeeper, such as Turn Over, occurs exactly once, even if the request is sent multiple times.
Due to the rapid growth in Scorekeeper usage from new leagues and Tri-Annual tournament testing, this system generated data faster than we anticipated, and the storage filled up. This prevented any new information from being processed, resulting in our applications becoming unavailable. A monitoring system that should have warned us about the low storage failed to send an alert.
Our team received the first notification at 8:33 PM and immediately began investigating. The underlying storage issue was resolved by 9:00 PM. It then took an additional 19 minutes to restart our applications and fully restore service to all users.
We sincerely apologize for the disruption this caused to our users and the impact it had on our operations. We are committed to learning from this incident and improving the reliability of our platform.
idempotency
table reached 100% capacity. This prevented SQL Server from executing a file growth operation, causing database write requests to time out and leading to a cascading failure in the GraphQL(API) services, including our authentication service.The immediate cause was a full disk volume, but two contributing factors led to the outage:
idempotency
table became unsustainable due to a recent, rapid increase in Scorekeeper volume from wider league adoption and new tournament usage.The 19-minute recovery time (from 9:00 PM to 9:19 PM) after the database was healthy is unacceptable. The GraphQL servers were not resilient to a prolonged database outage and failed to recover gracefully, necessitating a manual restart.
Immediate Remediation (Completed):
idempotency
table data retention policy has been reduced from 14 days to 4 days.Action Item 1: Fix Alerting (In Progress):
Action Item 2: Improve Service Resiliency (In Progress):