We are experiencing delays with activating and processing Journeys
Affected components
Offline Job Processing
Updates

Write-up published

Read it here

Resolved

On November 26th and 27th, customers may have experienced issues creating journeys or delays in messages being sent from journeys.

At this time, all Journeys functionality is restored and your messages are being sent in real-time once again.

My team and I sincerely apologize for any disruption this may have caused. We understand the critical role this service plays in your business, especially during this busy holiday season, and we are committed to providing reliable and uninterrupted service.

We're conducting a thorough post-mortem analysis of the incident. Here are some current insights into the issue and the steps we’re taking to prevent it from occurring again:

Root Cause:

  • On Tuesday, Nov 26th, one of our primary Journey data stores encountered an issue during a planned scaling operation in preparation for Black Friday. As a result, some Journeys failed to launch, and processing was delayed.

  • On Wednesday, Nov 27th, we experienced a separate incident related to fanning out updates to a large number of subscriptions under a single user record. To mitigate this issue, we blocked the problematic user records. Subsequently, the system began to recover, and services gradually returned to normal operations.

  • Our engineering team has now successfully scaled all services to accelerate recovery and has restored the full functionality of Journeys.

Measures our team has taken to prevent further disruption during this holiday period:

  • Proactive User Record Management: Implemented measures to proactively prevent user records from exceeding a very large number of subscriptions.

  • Enhanced Monitoring and Alerting: Increased the sensitivity of our monitoring and alerting systems for critical Journey services, lowering the paging threshold to expedite response times for potential issues.

  • Scaled Infrastructure: Maintained a scaled and over-provisioned infrastructure throughout the US Thanksgiving holiday week to accommodate increased traffic and ensure optimal performance.

  • Increased On-Call Support: Assigned additional engineers to on-call duty during the week to provide immediate support and address any potential issues that may arise.

  • Infrastructure Update Moratorium: Temporarily restricted any non-critical infrastructure updates during the week to minimize the risk of unintended disruptions to the Journey service.

Thank you for your understanding and patience.

Wed, Nov 27, 2024, 10:37 PM

Resolved

This incident has been resolved. We will continue to monitor our systems closely.

Wed, Nov 27, 2024, 10:18 PM(19 minutes earlier)

Monitoring

We have published a postmortem of this incident.

Wed, Nov 27, 2024, 08:51 PM(1 hour earlier)

Monitoring

Less than 1% of users in any given Journey will experience 5m+ delay in processing, all other operations have resumed as normal.

We are still monitoring current progress and still consider this issue to be open. If progress continues as it currently is, this incident will be fully resolved within the hour.

Customers are good to activate their Journeys and resume business as usual.

Wed, Nov 27, 2024, 07:41 PM(1 hour earlier)

Monitoring

Lag across a majority of partitions in Journeys is near realtime, with a single partition still experiencing lag greater than 5m.

This means that a fraction of the users in a Journey will experience processing delays, but majority of customers should be able to use Journeys as normal now. Specifically, right now, it’s only ~10% of users will experience 5+ mins of latency in journeys.

Wed, Nov 27, 2024, 07:29 PM(12 minutes earlier)

Monitoring

Processing of backlog for jobs is up and we are working through the backlog. Team is monitoring closely, some Journeys are able to be activated at this point, but processing for Journeys is still delayed.

Wed, Nov 27, 2024, 06:51 PM(37 minutes earlier)

Monitoring

We have identified other issues and have applied more fixes. We are currently monitoring its progress closely and adjusting as we go.

Wed, Nov 27, 2024, 05:12 PM(1 hour earlier)

Monitoring

We are continuing to monitor for any further issues.

Wed, Nov 27, 2024, 07:00 AM(10 hours earlier)

Monitoring

We have applied a fix and are currently monitoring the results.

Wed, Nov 27, 2024, 01:01 AM(5 hours earlier)

Identified

We are continuing to work on a fix for this issue.

Wed, Nov 27, 2024, 12:05 AM(56 minutes earlier)

Identified

We are continuing to work on a fix for this issue.

Tue, Nov 26, 2024, 08:34 PM(3 hours earlier)

Identified

We are continuing to work on a fix for this issue.

Tue, Nov 26, 2024, 07:35 PM(58 minutes earlier)

Identified

The issue has been identified and a fix is being implemented.

Tue, Nov 26, 2024, 03:36 PM(3 hours earlier)

Investigating

We are continuing to investigate this issue.

Tue, Nov 26, 2024, 01:53 PM(1 hour earlier)

Investigating

We are continuing to investigate this issue.

Tue, Nov 26, 2024, 12:47 PM(1 hour earlier)

Investigating

We are investigating an increased lag while processing the state of journeys. This is impacting the ability to active new journeys and existing journeys will experience a delay

Tue, Nov 26, 2024, 09:00 AM(3 hours earlier)