Overall System Status:
At approximately 8:15 AM, Dartmouth's Application Infrastructure Group (AIG) was made aware that some users could not log into email hosted in the cloud. A large group of people were already logged into email and they did not detect a problem. By 8:30 AM, AIG became aware that the issue was far more systemic and could impact all of Dartmouth Office 365 users and AIG opened a Severity 1 Ticket with Microsoft, as well as notifying the community.
Further investigation showed that Dartmouth users who tried to login to Office 365 (known on campus as new Blitz) could not access their email or calendar accounts. However, users who had signed into Office 365 before 8:00AM had access to mail/calendar service and they continued to have access throughout the day, unless they signed out and tried to sign in again.
The Dartmouth investigation began as soon as the symptoms were reported. The investigation led to focusing on the Active Directory Federation Service (ADFS) housed at Dartmouth which manages account credential information (username and password.) When a user logs into Office 365 from Dartmouth, Dartmouth ADFS authenticates the user before they are allowed to access their mail at Microsoft. Dartmouth could not find any local problems with ADFS and after 30 minutes of investigation, Dartmouth called for Microsoft Office 365 support.
Dartmouth worked with 4 Microsoft engineers before the Microsoft ADFS product specialist started working with Dartmouth.
The issue did involve ADFS and Dartmouth's connection to Microsoft. This ADFS service provides federation between the Microsoft and Dartmouth environments. In order for each domain to speak to each other there must be a trusted relationship established. Basically these servers share certificate information. These certificates expire after 1 year. What happened in our case is that an ADFS internal process created a new certificate 3 weeks prior to expiration (May 15th) and then decided to start using it on May 1. When ADFS started to use the new cert, the Microsoft servers rejected any connections coming from this non-verified certificate. In order to resolve it, changes had to be made on the Microsoft servers with specific commands to tell them to accept this new certificate.
Once the issue was identified, it took about 30 minutes to run the proper commands which cleared the problem. Immediately email started flowing and Computing Services notified the community that the email related issues were resolved.
Microsoft delayed getting the correct resources on this ticket for several hours. Despite the numerous requests from Dartmouth and our Microsoft Account executives, Dartmouth believes the case was internally mishandled at Microsoft. As soon as an ADFS specialist was assigned to the case, the problem was identified and resolved within 1 hour.
Microsoft is providing Dartmouth with instructions on how to manually manage the cert process on ADFS, so that Microsoft servers will know that there is a new cert that should be accepted.
This particular problem won't happen again, but other problems may happen. Therefore, Dartmouth is requiring Microsoft to provide an improved problem escalation process. It took too long to get to the right Microsoft specialist. If we had gotten to the ADFS specialist early in the day, the problem would have been resolved within an hour.
ADFS runs on a server at Dartmouth and is used to authenticate users of Office 365. We run this service at Dartmouth because we chose not to have usernames and passwords in the Microsoft cloud.
When accessing Office 365, users enter their login credentials (username and password)ADFS checks the credentials against its database of credentialsTo complete the access to the Office 365 mail/calendar account, Dartmouth's ADFS connects to the Microsoft servers; the Microsoft servers must recognize and "trust" traffic from Dartmouth's ADFS and opens the connection to the user
The core issue is that there is an undocumented process in which to manage the trust relationship and the authentication token certificates between the on premises ADFS system and the Microsoft system. The following description is a process designed by Microsoft to re-establish the Federation trust. The process is partially automated and partially manual which is the root cause of the issue.
The Microsoft Engineer described the process this way:
Prior to the certificate expiration the ADFS servers will automatically generate a new certificate and put it aside for later use.In a set number of days this certificate will then become active. The exact number is still being determined.An exchange admin will then need to go into the ADFS environment during this window of time and enable the new certificate on the Microsoft environment.
If the process had been designed to be completely manual, then Dartmouth's AIG would have had the visibility that the certificate was going to expire and fully completed the process.
Last Updated: 5/2/12