{"id":447,"date":"2026-05-06T04:08:11","date_gmt":"2026-05-06T04:08:11","guid":{"rendered":"https:\/\/offision.com\/blog\/?p=447"},"modified":"2026-05-06T04:09:12","modified_gmt":"2026-05-06T04:09:12","slug":"incident-report-5-5-2026-server-performance-instability-on-offision","status":"publish","type":"post","link":"https:\/\/offision.com\/blog\/2026\/05\/06\/incident-report-5-5-2026-server-performance-instability-on-offision\/","title":{"rendered":"Incident Report (5-5-2026): Server Performance Instability on Offision"},"content":{"rendered":"\n<p id=\"p-rc_fb55db1867f81a91-37\">On May 5, 2026, the Offision production server experienced unstable loading performance. This incident was caused by a recurring crash-and-reboot loop. A null pointer exception, which originated from a single customer&#8217;s edge case within the Pub\/Sub module, crashed the entire service. Because the exception was not wrapped in a try\/catch block, the service repeatedly crashed and restarted, ultimately degrading performance and slowing the server for all users.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Impact<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Affected Scope:<\/strong> The incident impacted all customers.<\/li>\n\n\n\n<li><strong>Services Affected:<\/strong> The production application server, specifically the Pub\/Sub messaging service, was affected.<\/li>\n\n\n\n<li><strong>User Experience:<\/strong> Users experienced slow loading times. Users also experienced intermittent service unavailability due to the repeated server reboots.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"p-rc_fb55db1867f81a91-41\"><strong>Root Cause<\/strong> <\/h2>\n\n\n\n<p id=\"p-rc_fb55db1867f81a91-41\">A corner case from a single customer triggered a null pointer exception inside the Pub\/Sub module. The specific code path lacked a try\/catch block to safeguard against unexpected null values, which meant the unhandled exception propagated up and caused the entire service to crash. While the service automatically restarted, the customer&#8217;s data continued to trigger the exact same exception, putting the server into a continuous crash-and-reboot loop. This constant restart cycle produced the perceived slowness for all customers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Timeline (May 5, 2026 &#8211; HKT)<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>10:30:<\/strong> The issue was first reported.<\/li>\n\n\n\n<li><strong>10:35:<\/strong> The issue was confirmed by the Offision team. The total time to detection was 5 minutes.<\/li>\n\n\n\n<li><strong>10:45:<\/strong> The server was rebooted to restore service. The total time to mitigation was 15 minutes.<\/li>\n\n\n\n<li><strong>11:50:<\/strong> The root cause was identified.<\/li>\n\n\n\n<li><strong>12:00:<\/strong> The issue was fixed in the code.<\/li>\n\n\n\n<li><strong>14:10:<\/strong> The fix was deployed to production in version 4.3.11. The total time to resolution was 3 hours and 40 minutes.<\/li>\n<\/ul>\n\n\n\n<p id=\"p-rc_fb55db1867f81a91-48\"><em>Note on Deployment Delay:<\/em> While the code fix was ready at 12:00 HKT, the actual deployment was delayed until 14:10 HKT<sup><\/sup>. This delay was partly due to an outage with the Ubuntu archive mirror (archive.ubuntu.com) occurring at the same time<sup><\/sup>. The outage slowed down the master image build needed to ship the version 4.3.11 fix, blocking the build pipeline<sup><\/sup>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"p-rc_fb55db1867f81a91-49\"><strong>Resolution and Next Steps<\/strong> <\/h2>\n\n\n\n<p id=\"p-rc_fb55db1867f81a91-49\">To resolve the incident immediately, our team deployed a quick fix (version 4.3.11) on May 5, 2026, at 14:10 HKT. This quick fix successfully patched the null pointer scenario in the Pub\/Sub module to prevent the exception from being thrown.<\/p>\n\n\n\n<p>To prevent this class of issue from happening again, we are undertaking the following long-term fixes and action items:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We deployed the quick fix (v4.3.11) to prevent the specific null pointer.<\/li>\n\n\n\n<li>We will improve the overall Pub\/Sub mechanism so that null pointer exceptions, as well as other unexpected runtime errors, cannot bring down the entire service.<\/li>\n\n\n\n<li>We will add proper exception isolation around message handlers so a single customer&#8217;s bad data cannot affect others.<\/li>\n\n\n\n<li>We will refactor the Pub\/Sub module to isolate per-message exception handling.<\/li>\n\n\n\n<li>We will review other shared services for similar unhandled exception risks.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>On May 5, 2026, the Offision production server experienced unstable loading performance. This incident was caused by a recurring crash-and-reboot loop. A null pointer exception, which originated from a single customer&#8217;s edge case within the Pub\/Sub module, crashed the entire service. Because the exception was not wrapped in a try\/catch block, the service repeatedly crashed [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":65,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[189,160],"tags":[236],"class_list":["post-447","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-local-support","category-products-news","tag-incident-report"],"_links":{"self":[{"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/posts\/447","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/comments?post=447"}],"version-history":[{"count":2,"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/posts\/447\/revisions"}],"predecessor-version":[{"id":449,"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/posts\/447\/revisions\/449"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/media\/65"}],"wp:attachment":[{"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/media?parent=447"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/categories?post=447"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/offision.com\/blog\/wp-json\/wp\/v2\/tags?post=447"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}