Open Source Market Segment LS
Open Source Market Segment RS
Monday, 11 January 2021 13:16

How GitHub revamped its on-call strategy for over 50 engineering teams


Global open-source software host, GitHub, serves 56 million users and is the custodian of billions of lines of open-source code. With such a product, on-call is part of life, but it doesn’t have to come at the expense of work-life balance or accepting technical debt.

Whether we're performing routine maintenance out-of-hours, or we're in the midst of a global pandemic, robust on-call is essential for any technical organisation.

Like many companies such as Google, Netflix and others - even small organisations, let alone planet-size ones - as GitHub expands its product offerings by number and complexity it became critical to evolve its on-call strategy to maintain scale with its 56 million, and growing, users.

GitHub previously had a monolithic on-call structure but transformed it into having each of its 50+ engineering teams be responsible for the code they maintain. This including assigning ownership of 16,000 files from the monolithic Ruby on Rails codebase to those teams while dealing with educational hurdles and addressing the work-life balance.

This was not purely a logistical effort; Mary Moore Simmons, Director of Engineering at GitHub, also scoped out the various cultural and education hurdles, including adding COVID-19 into the mix, cultivating a blameless culture, creating training, dealing with criticality, and focusing on long-term success.

She detailed the adventure in a recent blog post, distilling her learnings and challenges for the benefit of IT professionals globally. In fact, putting aside GitHub’s size for a moment, the problems GitHub experienced are not so dissimilar to any other organisation.

Previously, GitHub's monolith spanned a huge number of products and features. Most engineers did not have enough familiarity with great swathes of the codebase to feel confident when responding to on-call incidents. This meant frequent escalations to another team and the engineer felt more like a switchboard operator than a team.

Additionally, the on-call rotation was large with a 24-hour on-call period. Consequently, engineers were only actually on-call about four times per year and thus invariably many never gained the context to provide this confidence.

Compounding the situation, the monitoring and documentation were not well-maintained because the on-call rotation was spread so far and engineers only had to deal with it for 24 hours at a time. Without determined effort, the result was noisy alerts and poor runbooks.

Then, because most engineers weren't confident with the monolithic on-call shift, the same small group of people who knew the platform best were involved in every production incident causing an imbalance in on-call responsibilities and taking their time from any other project they were involved in.

Simmons knew something had to be done. The first step was logistics and assigning ownership of specific code to specific engineering teams. With a 16,000+ file codebase in one monolith, this was no small effort. To resolve it, GitHub rolled out a new system to associate files to services, and services to teams. For example, components of the API belong to the apps team while the permissions model belongs to the authorisation team.

This detail is now pulled into an internal Service Catalog, so any GitHub staff member can identify unambiguously which engineering team owns which service.

To ensure compliance, a new lint rule was added that prevented any code being updated in, or added to, the monolith without the ownership information being supplied.

Monitoring and alerting was split up so teams set up monitoring relating to their area only before ultimately all sorts of no-longer-needed alerts were decommissioned entirely.

Nicely, to help the many diverse teams know what they needed to do GitHub “ate its own dog food,” creating GitHub issues for every team with clear checklists.

It wasn't all smooth sailing. In fact, GitHub began this journey back in 2019, some seven months before COVID-19 was announced as a global pandemic. The added stress of a pandemic magnified anxieties and necessitated changing the project management to a higher-touch, empathy-first approach.

Many engineers had never been on-call in the past and lacked experience with operational best practice. To cater for this, Simmons and her team designed and delivered training, created significant tooling and documentation, and opened Slack channels where anyone could ask for help.

Reasonably, some engineers were anxious about the impact on-call would have on their lives. How would they respond to a page within minutes while attempting to do everyday tasks like grocery shopping? Simmons and the team worked with the teams to understand concerns, document tips and tricks from experienced on-call engineers, and work with people one-on-one where needed. They also reinforced that team members are there to support each other; someone could take over on-call for a couple of hours if a colleague needed to go for a run or handle childcare, for example. In fact, leveraging GitHub’s global presence meant they could lean on team members in other parts of the world to take on on-call while still in their ordinary hours.

An essential message Simmons and her team made was that of a blameless culture. They found another anxiety was engineers who were concerned about letting their team down while working on-call. As an organisation, GitHub reinforced mistakes are ok, outages happen, but people who bravely work on something they’re not familiar with when on-call ought to be celebrated.

Each engineering team had different levels of criticality; some need resolution within minutes, while others can wait until the next business day. Some engineers were concerned this caused an unfair balance, but GitHub sees this as a self-resolving problem. Different engineers want to work on more business-critical and technically complex systems, while others have different interests, and thus each engineer will naturally select teams with the operational rigour they identify most strongly with.

Importantly - and a lesson for any organisation - the whole on-call experience needed to feedback into itself to make the overall experience better. The person on-call ought to be active when not responding to pages updating runbooks, tuning noisy alerts, scripting or automating on-call tasks, and fixing the underlying technical debt.

By the end of this journey, Simmons notes, GitHub's incident resolution time improved but the journey is not over. Nor will it ever be; organisations need constantly improve their best-practices and the cultural changes Simmons and her team identified need continual promotion.

They also need continual feedback, and to that end Simmons states they will regularly survey engineers about their on-call experience to always be learning and improving in their drive for excellence, to continue to be the trusted home for all developers.


Read 11667 times

Please join our community here and become a VIP.

Subscribe to ITWIRE UPDATE Newsletter here
JOIN our iTWireTV our YouTube Community here


The past year has seen a meteoric rise in ransomware incidents worldwide.

Over the past 12 months, SonicWall Capture Labs threat researchers have diligently tracked the meteoric rise in cyberattacks, as well as trends and activity across all threat vectors, including:

Encrypted threats
IoT malware
Zero-day attacks and more

These exclusive findings are now available via the 2022 SonicWall Cyber Threat Report, which ensures SMBs, government agencies, enterprises and other organizations have the actionable threat intelligence needed to combat the rising tide of cybercrime.

Click the button below to get the report.



It's all about Webinars.

Marketing budgets are now focused on Webinars combined with Lead Generation.

If you wish to promote a Webinar we recommend at least a 3 to 4 week campaign prior to your event.

The iTWire campaign will include extensive adverts on our News Site and prominent Newsletter promotion and Promotional News & Editorial. Plus a video interview of the key speaker on iTWire TV which will be used in Promotional Posts on the iTWire Home Page.

Now we are coming out of Lockdown iTWire will be focussed to assisting with your webinars and campaigns and assistance via part payments and extended terms, a Webinar Business Booster Pack and other supportive programs. We can also create your adverts and written content plus coordinate your video interview.

We look forward to discussing your campaign goals with you. Please click the button below.


David M Williams

David has been computing since 1984 where he instantly gravitated to the family Commodore 64. He completed a Bachelor of Computer Science degree from 1990 to 1992, commencing full-time employment as a systems analyst at the end of that year. David subsequently worked as a UNIX Systems Manager, Asia-Pacific technical specialist for an international software company, Business Analyst, IT Manager, and other roles. David has been the Chief Information Officer for national public companies since 2007, delivering IT knowledge and business acumen, seeking to transform the industries within which he works. David is also involved in the user group community, the Australian Computer Society technical advisory boards, and education.

Share News tips for the iTWire Journalists? Your tip will be anonymous




Guest Opinion

Guest Interviews

Guest Reviews

Guest Research

Guest Research & Case Studies

Channel News