Early this morning, approx. 5:30AM MST, our MX01 SmarterMail service suffered a crash. While the Windows service manager kept attempting to restart the service, the service was unable to remain stable. SmarterTools was contacted and started investigating the issue within 15 minutes of us submitting the ticket. After two development builds and two completely separate issues they were able to get the server online. Below is the report of what caused the server to crash initially and remain offline.
- A client received a single email that caused a stack overflow in the HTML Agility Pack module of SmarterMail. This was the cause of subsequent crashes.
- An email in spool was causing an unhandled exception within the SpamAssassin pattern matching engine which. This caused the initial crash.
Issue #2 came as a surprise as we do not use SpamAssassin, our anti-spam system relies on Commtouch. SmarterTools has released a patch for this that we will need to apply to prevent this in the future. The offending email was removed and the spool system was able to start successfully. We have received the development build from SmarterTools that addresses this issue as well. We will need to apply this upgrade shortly.
Issue #1 was also a surprise as well as we had no idea the email was causing trouble. The server would have only failed to start if we would have performed an upgrade or restarted the server for any reason. The issue has been resolved in the current build we are running.
We are very pleased that the SmarterTools development and support teams were able to respond quickly and diagnose the issue as well as issue an all-new build.
We would also like to thank our clients for being patient during this time. We understand that outages are annoying and get in the way of day-to-day activities, unfortunately this was an issue that no one saw coming and could not have been prepared for.
Thank you for choosing ASPnix!
And to address a question we received… “Would a failover server may have prevented this?”
Unfortunately no, as the 1st mail server would have crashed, then the second one would have started up and crashed as well as both the main and failover share the same data, including the bad data that caused the crash.