Incident Report (5-5-2026): Server Performance Instability on Offision

On May 5, 2026, the Offision production server experienced unstable loading performance. This incident was caused by a recurring crash-and-reboot loop. A null pointer exception, which originated from a single customer’s edge case within the Pub/Sub module, crashed the entire service. Because the exception was not wrapped in a try/catch block, the service repeatedly crashed and restarted, ultimately degrading performance and slowing the server for all users.

Impact

  • Affected Scope: The incident impacted all customers.
  • Services Affected: The production application server, specifically the Pub/Sub messaging service, was affected.
  • User Experience: Users experienced slow loading times. Users also experienced intermittent service unavailability due to the repeated server reboots.

Root Cause

A corner case from a single customer triggered a null pointer exception inside the Pub/Sub module. The specific code path lacked a try/catch block to safeguard against unexpected null values, which meant the unhandled exception propagated up and caused the entire service to crash. While the service automatically restarted, the customer’s data continued to trigger the exact same exception, putting the server into a continuous crash-and-reboot loop. This constant restart cycle produced the perceived slowness for all customers.

Timeline (May 5, 2026 – HKT)

  • 10:30: The issue was first reported.
  • 10:35: The issue was confirmed by the Offision team. The total time to detection was 5 minutes.
  • 10:45: The server was rebooted to restore service. The total time to mitigation was 15 minutes.
  • 11:50: The root cause was identified.
  • 12:00: The issue was fixed in the code.
  • 14:10: The fix was deployed to production in version 4.3.11. The total time to resolution was 3 hours and 40 minutes.

Note on Deployment Delay: While the code fix was ready at 12:00 HKT, the actual deployment was delayed until 14:10 HKT. This delay was partly due to an outage with the Ubuntu archive mirror (archive.ubuntu.com) occurring at the same time. The outage slowed down the master image build needed to ship the version 4.3.11 fix, blocking the build pipeline.

Resolution and Next Steps

To resolve the incident immediately, our team deployed a quick fix (version 4.3.11) on May 5, 2026, at 14:10 HKT. This quick fix successfully patched the null pointer scenario in the Pub/Sub module to prevent the exception from being thrown.

To prevent this class of issue from happening again, we are undertaking the following long-term fixes and action items:

  • We deployed the quick fix (v4.3.11) to prevent the specific null pointer.
  • We will improve the overall Pub/Sub mechanism so that null pointer exceptions, as well as other unexpected runtime errors, cannot bring down the entire service.
  • We will add proper exception isolation around message handlers so a single customer’s bad data cannot affect others.
  • We will refactor the Pub/Sub module to isolate per-message exception handling.
  • We will review other shared services for similar unhandled exception risks.

Leave a Reply

Your email address will not be published. Required fields are marked *