We are experiencing delays with activating and processing Journeys
Incident Report for OneSignal
Postmortem

On November 26th and 27th, customers may have experienced issues creating journeys or delays in messages being sent from journeys.

At this time, all Journeys functionality is restored and your messages are being sent in real-time once again.

My team and I sincerely apologize for any disruption this may have caused. We understand the critical role this service plays in your business, especially during this busy holiday season, and we are committed to providing reliable and uninterrupted service.

We're conducting a thorough post-mortem analysis of the incident. Here are some current insights into the issue and the steps we’re taking to prevent it from occurring again:

Root Cause:

  • On Tuesday, Nov 26th, one of our primary Journey data stores encountered an issue during a planned scaling operation in preparation for Black Friday. As a result, some Journeys failed to launch, and processing was delayed.
  • On Wednesday, Nov 27th, we experienced a separate incident related to fanning out updates to a large number of subscriptions under a single user record. To mitigate this issue, we blocked the problematic user records. Subsequently, the system began to recover, and services gradually returned to normal operations.
  • Our engineering team has now successfully scaled all services to accelerate recovery and has restored the full functionality of Journeys.

Measures our team has taken to prevent further disruption during this holiday period:

  • Proactive User Record Management: Implemented measures to proactively prevent user records from exceeding a very large number of subscriptions.
  • Enhanced Monitoring and Alerting: Increased the sensitivity of our monitoring and alerting systems for critical Journey services, lowering the paging threshold to expedite response times for potential issues.
  • Scaled Infrastructure: Maintained a scaled and over-provisioned infrastructure throughout the US Thanksgiving holiday week to accommodate increased traffic and ensure optimal performance.
  • Increased On-Call Support: Assigned additional engineers to on-call duty during the week to provide immediate support and address any potential issues that may arise.
  • Infrastructure Update Moratorium: Temporarily restricted any non-critical infrastructure updates during the week to minimize the risk of unintended disruptions to the Journey service.

Thank you for your understanding and patience.

Posted Nov 27, 2024 - 22:38 UTC

Resolved
This incident has been resolved. We will continue to monitor our systems closely.
Posted Nov 27, 2024 - 22:18 UTC
Update
We have published a postmortem of this incident.
Posted Nov 27, 2024 - 20:51 UTC
Update
Less than 1% of users in any given Journey will experience 5m+ delay in processing, all other operations have resumed as normal.

We are still monitoring current progress and still consider this issue to be open. If progress continues as it currently is, this incident will be fully resolved within the hour.

Customers are good to activate their Journeys and resume business as usual.
Posted Nov 27, 2024 - 19:41 UTC
Update
Lag across a majority of partitions in Journeys is near realtime, with a single partition still experiencing lag greater than 5m.

This means that a fraction of the users in a Journey will experience processing delays, but majority of customers should be able to use Journeys as normal now. Specifically, right now, it’s only ~10% of users will experience 5+ mins of latency in journeys.
Posted Nov 27, 2024 - 19:29 UTC
Update
Processing of backlog for jobs is up and we are working through the backlog. Team is monitoring closely, some Journeys are able to be activated at this point, but processing for Journeys is still delayed.
Posted Nov 27, 2024 - 18:51 UTC
Update
We have identified other issues and have applied more fixes. We are currently monitoring its progress closely and adjusting as we go.
Posted Nov 27, 2024 - 17:12 UTC
Update
We are continuing to monitor for any further issues.
Posted Nov 27, 2024 - 07:00 UTC
Monitoring
We have applied a fix and are currently monitoring the results.
Posted Nov 27, 2024 - 01:01 UTC
Update
We are continuing to work on a fix for this issue.
Posted Nov 27, 2024 - 00:05 UTC
Update
We are continuing to work on a fix for this issue.
Posted Nov 26, 2024 - 20:34 UTC
Update
We are continuing to work on a fix for this issue.
Posted Nov 26, 2024 - 19:35 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 26, 2024 - 15:36 UTC
Update
We are continuing to investigate this issue.
Posted Nov 26, 2024 - 13:53 UTC
Update
We are continuing to investigate this issue.
Posted Nov 26, 2024 - 12:47 UTC
Investigating
We are investigating an increased lag while processing the state of journeys. This is impacting the ability to active new journeys and existing journeys will experience a delay
Posted Nov 26, 2024 - 09:00 UTC
This incident affected: Offline Job Processing.