I’ll keep this brief, as the outage was short and had a simple cause.
We’ve been investing in automated monitors to alert us when performance issues begin to develop. The goal is to catch problems well before they impact league night. Last night, we were testing one of these monitors, and it did not handle SK traffic well.
As reports came in, the person working on the monitor quickly noticed the correlation and disabled it. Performance returned to normal immediately.
This monitor had been tested in other environments for several days without issue. However, the scale of SK traffic exposed a performance impact that did not appear elsewhere.
We’ll continue improving our testing approach to better account for production-level traffic before rolling out changes like this.