On November 26th and 27th, customers may have experienced issues creating journeys or delays in messages being sent from journeys.
At this time, all Journeys functionality is restored and your messages are being sent in real-time once again.
My team and I sincerely apologize for any disruption this may have caused. We understand the critical role this service plays in your business, especially during this busy holiday season, and we are committed to providing reliable and uninterrupted service.
We're conducting a thorough post-mortem analysis of the incident. Here are some current insights into the issue and the steps we’re taking to prevent it from occurring again:
Root Cause:
- On Tuesday, Nov 26th, one of our primary Journey data stores encountered an issue during a planned scaling operation in preparation for Black Friday. As a result, some Journeys failed to launch, and processing was delayed.
- On Wednesday, Nov 27th, we experienced a separate incident related to fanning out updates to a large number of subscriptions under a single user record. To mitigate this issue, we blocked the problematic user records. Subsequently, the system began to recover, and services gradually returned to normal operations.
- Our engineering team has now successfully scaled all services to accelerate recovery and has restored the full functionality of Journeys.
Measures our team has taken to prevent further disruption during this holiday period:
- Proactive User Record Management: Implemented measures to proactively prevent user records from exceeding a very large number of subscriptions.
- Enhanced Monitoring and Alerting: Increased the sensitivity of our monitoring and alerting systems for critical Journey services, lowering the paging threshold to expedite response times for potential issues.
- Scaled Infrastructure: Maintained a scaled and over-provisioned infrastructure throughout the US Thanksgiving holiday week to accommodate increased traffic and ensure optimal performance.
- Increased On-Call Support: Assigned additional engineers to on-call duty during the week to provide immediate support and address any potential issues that may arise.
- Infrastructure Update Moratorium: Temporarily restricted any non-critical infrastructure updates during the week to minimize the risk of unintended disruptions to the Journey service.
Thank you for your understanding and patience.