Whether we're performing routine maintenance out-of-hours, or we're in the midst of a global pandemic, robust on-call is essential for any technical organisation.
Like many companies such as Google, Netflix and others - even small organisations, let alone planet-size ones - as GitHub expands its product offerings by number and complexity it became critical to evolve its on-call strategy to maintain scale with its 56 million, and growing, users.
GitHub previously had a monolithic on-call structure but transformed it into having each of its 50+ engineering teams be responsible for the code they maintain. This including assigning ownership of 16,000 files from the monolithic Ruby on Rails codebase to those teams while dealing with educational hurdles and addressing the work-life balance.
This was not purely a logistical effort; Mary Moore Simmons, Director of Engineering at GitHub, also scoped out the various cultural and education hurdles, including adding COVID-19 into the mix, cultivating a blameless culture, creating training, dealing with criticality, and focusing on long-term success.
She detailed the adventure in a recent blog post, distilling her learnings and challenges for the benefit of IT professionals globally. In fact, putting aside GitHub’s size for a moment, the problems GitHub experienced are not so dissimilar to any other organisation.
|
Additionally, the on-call rotation was large with a 24-hour on-call period. Consequently, engineers were only actually on-call about four times per year and thus invariably many never gained the context to provide this confidence.
Compounding the situation, the monitoring and documentation were not well-maintained because the on-call rotation was spread so far and engineers only had to deal with it for 24 hours at a time. Without determined effort, the result was noisy alerts and poor runbooks.
Then, because most engineers weren't confident with the monolithic on-call shift, the same small group of people who knew the platform best were involved in every production incident causing an imbalance in on-call responsibilities and taking their time from any other project they were involved in.
Simmons knew something had to be done. The first step was logistics and assigning ownership of specific code to specific engineering teams. With a 16,000+ file codebase in one monolith, this was no small effort. To resolve it, GitHub rolled out a new system to associate files to services, and services to teams. For example, components of the API belong to the apps team while the permissions model belongs to the authorisation team.
This detail is now pulled into an internal Service Catalog, so any GitHub staff member can identify unambiguously which engineering team owns which service.
To ensure compliance, a new lint rule was added that prevented any code being updated in, or added to, the monolith without the ownership information being supplied.
Monitoring and alerting was split up so teams set up monitoring relating to their area only before ultimately all sorts of no-longer-needed alerts were decommissioned entirely.
Nicely, to help the many diverse teams know what they needed to do GitHub “ate its own dog food,” creating GitHub issues for every team with clear checklists.
It wasn't all smooth sailing. In fact, GitHub began this journey back in 2019, some seven months before COVID-19 was announced as a global pandemic. The added stress of a pandemic magnified anxieties and necessitated changing the project management to a higher-touch, empathy-first approach.
Many engineers had never been on-call in the past and lacked experience with operational best practice. To cater for this, Simmons and her team designed and delivered training, created significant tooling and documentation, and opened Slack channels where anyone could ask for help.
Reasonably, some engineers were anxious about the impact on-call would have on their lives. How would they respond to a page within minutes while attempting to do everyday tasks like grocery shopping? Simmons and the team worked with the teams to understand concerns, document tips and tricks from experienced on-call engineers, and work with people one-on-one where needed. They also reinforced that team members are there to support each other; someone could take over on-call for a couple of hours if a colleague needed to go for a run or handle childcare, for example. In fact, leveraging GitHub’s global presence meant they could lean on team members in other parts of the world to take on on-call while still in their ordinary hours.
An essential message Simmons and her team made was that of a blameless culture. They found another anxiety was engineers who were concerned about letting their team down while working on-call. As an organisation, GitHub reinforced mistakes are ok, outages happen, but people who bravely work on something they’re not familiar with when on-call ought to be celebrated.
Each engineering team had different levels of criticality; some need resolution within minutes, while others can wait until the next business day. Some engineers were concerned this caused an unfair balance, but GitHub sees this as a self-resolving problem. Different engineers want to work on more business-critical and technically complex systems, while others have different interests, and thus each engineer will naturally select teams with the operational rigour they identify most strongly with.
Importantly - and a lesson for any organisation - the whole on-call experience needed to feedback into itself to make the overall experience better. The person on-call ought to be active when not responding to pages updating runbooks, tuning noisy alerts, scripting or automating on-call tasks, and fixing the underlying technical debt.
By the end of this journey, Simmons notes, GitHub's incident resolution time improved but the journey is not over. Nor will it ever be; organisations need constantly improve their best-practices and the cultural changes Simmons and her team identified need continual promotion.
They also need continual feedback, and to that end Simmons states they will regularly survey engineers about their on-call experience to always be learning and improving in their drive for excellence, to continue to be the trusted home for all developers.