Medchat - Application Service Interruption – Incident details

Application Service Interruption

Resolved
Operational
Started 3 months agoLasted about 6 hours

Affected

Authentication

Operational from 11:06 PM to 11:20 PM, Major outage from 11:20 PM to 2:51 AM, Operational from 2:51 AM to 5:19 AM

Medchat Auth Application

Operational from 11:06 PM to 11:20 PM, Major outage from 11:20 PM to 2:51 AM, Operational from 2:51 AM to 5:19 AM

Google SSO

Operational from 11:06 PM to 11:20 PM, Major outage from 11:20 PM to 2:51 AM, Operational from 2:51 AM to 5:19 AM

Custom OIDC SSO

Operational from 11:06 PM to 11:20 PM, Major outage from 11:20 PM to 2:51 AM, Operational from 2:51 AM to 5:19 AM

Custom SAML SSO

Operational from 11:06 PM to 11:20 PM, Major outage from 11:20 PM to 2:51 AM, Operational from 2:51 AM to 5:19 AM

Live Chat

Operational from 11:06 PM to 11:20 PM, Major outage from 11:20 PM to 2:51 AM, Operational from 2:51 AM to 5:19 AM

Updates
  • Resolved
    Resolved

    While Microsoft is keeping their incident tickets open, they have confirmed that majority of their impacted services have now recovered. They have an updated dashboard where all services on the affected region are now back online.

    Root cause per Microsoft: The underlying cause was due to a backend cluster management workflow deployed a configuration change that caused backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks.

    The Medchat team also conducted some health checks to ensure the system is back up and fully functional. Closing this incident.


  • Monitoring
    Update

    While the Medchat application is back online and seems to be back in a fully functional state, Microsoft is continuing to send status updates (last at 9:10 pm PT) that the incident is still active, and that customers should continue to see increasing recovery at this time as residual and downstream impact mitigation progresses.

    Medchat team will continue to monitor system health and status updates.

  • Monitoring
    Update

    Microsoft update @7:33 pm PT

    • Status: Incident is now mitigated

    • Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.


    We are currently performing checks on the Medchat Application to ensure overall system health; we will continue monitoring for another hour or so.

  • Monitoring
    Monitoring

    Microsoft update at 6:31 pm PT:


    Current Status: ... We’ve determined the underlying cause and are currently working towards mitigation. We will start to see incremental recovery in next 90 minutes. The next update will be provided in 60 minutes, or as events warrant.

  • Identified
    Update

    Update from Microsoft @4:56 pm PT:


    Current status: We have determined this issue was impacted by an underlying storage outage in the Central US region that the services were dependent upon. Once the underlying storage outage is mitigated, this impact will be resolved. The next update will be in 2 hours, or as events warrants.

  • Identified
    Update

    Latest update from Microsoft @4:45 pm PT


    Impact Statement: Starting at 21:56 UTC on 18 Jul 2024, a subset of customers may experience issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services.

    Current Status: We are aware of this issue and have engaged multiple teams to investigate. As part of the investigation, we are reviewing previous deployments, and are running other workstreams to investigate for an underlying cause. The next update will be provided in 60 minutes, or as events warrant. 

    This message was last updated at 23:45 UTC on 18 July 2024

  • Identified
    Identified

    Confirmed that the issue is due to Microsoft Azure. From their website:


    Investigating issues in the Central US region

    Impact Statement: Starting at approximately 21:56 UTC on 18 Jul 2024, a subset of customers may experience issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services.

    Current Status: We are aware of this issue and are actively investigating. The next update will be provided in 60 minutes, or as events warrant.

    This message was last updated at 23:17 UTC on 18 July 2024

  • Investigating
    Investigating

    We are currently investigating this incident.

    At the moment, we're seeing widespread outage with Microsoft Azure, Medchat's Cloud Services provider. Microsoft has just sent out a status update that they are investigating issues in their Central US region.