11/22/2019 – v2 ACS001 Core Failure

Incident Report (Resolved)

Event Description: ACS001 Crash
Event Start Time: 2019-11-22 11:42 PM EST
Event End Time: 2019-11-23 01:29 AM EST
RFO Issue Date: 2019-11-25

Affected Services

Phones registered to Atlanta lost registration. Devices configured with SRV or UDP failed over. Devices configured as TCP or manually registered to core1-atl did not regain registration.

Event Summary

On November 22nd, 2019, at 23:42 EDT, the ACS cluster began crashing repeatedly. Several phones lost registration and the ability to make or receive calls.

Event Timeline

November 22nd, 2019

23:42 First crash reporting by monitoring systems. Phones lost registratration. Notice placed in partner server
23:43 Failover verified to SJE and NYJ clusters

June 7th, 2019

00:00 Issue verified isolated to Atlanta servers
00:25 Rolled back 40.2 updates in case that might have been the cuase of the issue
00:57 Services to Atlanta resumed and functional. Endpoints configured for UDP or SRV registered back to Atlanta cluster. Cause yet undetermined
01:29 Atlanta cluster remained online. All UDP and SRV phones remained registered. Some reports of call history not functioning. Would continue to investigate in the morning.
17:12 Call history page restored. All services functional

Root Cause Analysis

In troubleshooting with NetSapiens and our own senior engineers we determined that malformed TCP packets were causing a crash in the TCP stack. We believe the packets were isolated to a single device but more testing is needed. Normally this would not affect services. However, the repeated crashes and subsequent core dumps quickly filled the server’s storage. The eventual full storage prevented normal functions from processing.

Future Preventative Action

While not a permanent fix, the decision was made to block TCP device registration. This affected less than 1% of our total registered devices. Devices that were configured for TCP must be reconfigured to use UDP. We are continuing to work with NetSapiens senior engineer staff to determine how a single errant device could crash the stack. FInally, we have instituted additional safeguards that will immediately move core dump files off server immediately to prevent a full storage again.

Update 11/26/19: Worked with NS engineering to isolate issues to TCP connections in SIP trunks only. SIP trunk TCP functionality was left disabled. TCP functionality for endpoints was restored and devices re-registered successfully. Systems have been stable sense. NS will continue to follow up regarding SIP trunk TCP.

Updated on January 28, 2020

Was this article helpful?

Yes No

Need Support?

Can't find the answer you're looking for?

Contact Support

Affected Services

Event Summary

Event Timeline

Root Cause Analysis

Future Preventative Action

Related Articles