Dashboard failing to load
Overview
Problem: Authentication service experienced a failure for approximately 12 minutes.
Impact: Dashboard users experienced authentication failures and were logged out and unable to log back in. Downstream services experienced cascading throughput drops.
Root Causes: The root cause was to do with a faulty software upgrade which overwrote important system configuration.
Chronology
2026-01-29 (all times in UTC)
23:17:00 - software upgrade started
23:24:22 - first alert came through
23:29:00 - all upgrade changes were reverted configuration reapplied
23:33:00 - all services operational
Analysis
A software upgrade overwrote existing routing maps, preventing service routing. This caused authentication failures, an inability to use the dashboard, and a failure in API calls.
Mitigations
We expanded our local testing by adding additional microservice upgrade tests to our test cluster