Author's Opinion

The views in this column are those of the author and do not necessarily reflect the views of iTWire.

Have your say and comment below.

Monday, 10 November 2008 17:53

How two of the world's largest websites use Linux for high availability

By
Pop quiz: you have a web site and you want it to be popular. It must scale to tens, hundreds of thousands, even millions of visitors. It has to be snappy and responsive. What server platform will you host it on? Here’s what two of the world’s most popular sites – Wikipedia and Digg - went with, and it wasn’t Windows.

I have no doubt whatsoever you know about Wikipedia and an introduction is probably unnecessary. This is a comprehensive online encyclopaedia that is freely editable, which ensures currency of content while also being controversial for potential abuse.

Wikipedia is rated by Alexa as having a traffic rank of 8. Alexa is an independent website monitoring company who provide site ranking information based on a variety of sources over a rolling three month period. Primarily, they measure the number of individual pages visited by individual people.

By rating Wikipedia an ‘8’ Alexa are saying that Wikipedia is the 8th most popular web site out of every site the company collects data on. For contrast, my personal blog – low traffic, little content, updated infrequently – has a paltry rank of 7,476,670. This site, iTWire.com, is ranked 39,326. If you’re interested, here’s an interesting article comparing Wikipedia’s Alexa ranking against many, many other web sites.

If you have a site as massive as Wikipedia, hit as many times a day by as many people as it, what are you going to do? We’re talking more than 10 million distinct articles, in 250 different languages, served to over 684 million people per year.

The answer is that Wikipedia have a massive server farm driving their web site – some 400 servers, in fact. And each and every one runs Linux.

Previously, Wikipedia had a mix of Red Hat and Fedora Linux installations. These aren’t unsurprising choices. Red Hat Linux is a very well known and popular enterprise-grade server product. Fedora is also by the same company; it is Red Hat’s community-supported distribution which omits some of its big brother’s grunt.

You’ll appreciate that the systems administration burden of managing upgrades and patches and software compatibilities is made easier if every server has the exact same platform. Therefore, Wikipedia decided to migrate every server to a single Linux distribution across the board. They chose Ubuntu Linux, which is arguably the most popular Linux release today.

Now, here’s something I really want to hit on: how many systems administrators does it take to manage 400 Linux servers? Let me tell you!

CONTINUED






Five.

That’s right, it takes just five full time IT staff to run each and every one of Wikipedia’s 400 Linux servers.

Actually, I exaggerate. These five people are Wikipedia’s IT team in general; they have other duties to perform besides running the servers. So, we’re talking about a massive web site that demands high availability and it is handled at a ratio of 80 servers per person.

Now, I don’t want to be presumptuous, but do you believe one single Windows administrator can maintain 80 servers by him or herself? What do you think the effort involved in distributing patches would be? Or, perhaps I should ask what would be the cost, if you chose to implement Microsoft Systems Centre or Symantec Altiris or other such big hitters?

These servers perform a variety of purposes. There are the primary web servers, database servers and also caching proxy servers. The move to Ubuntu wasn’t effortless, and in fact Wikipedia broke this project up over almost two years.

However, the results have been tremendous. Wikipedia’s Chief Technology Officer Brion Vibber said that everything “has gotten a lot simpler. Mass upgrades can be done more easily, and the data center can be managed as a unit.”

“We can run the same combination everywhere, and it does the same thing. Everything is a million times easier,” he said.

Some commentators online have criticised the choice of Ubuntu Linux. Make no mistake, though; there’s not been anyone saying Wikipedia should have opted for any version of Microsoft’s Windows server products but instead debate arose over whether CentOS Linux – a free distribution which is compiled from the Red Hat Enterprise sources – or other versions would have been a better choice.

Nevertheless, Ubuntu or not, the fact remains that it is Linux which Wikipedia chose to run its server farm, spread over data centres in Tampa, Florida, and within South Korea and Amsterdam.

Linux has proven successful for Wikipedia. They have enjoyed reliable uptime and performance. They have scaled as their article count and user base has grown.

And importantly, for a non-profit foundation, there is no software cost for Wikipedia whatsoever. Can you imagine just how much a Windows solution would cost? 100% of the funding for infrastructure can be spent on hardware, bandwidth and server hosting.

What’s more, huge slabs of administration can be scripted and automated in a fashion far easier and more elegantly than is permissible in a Windows environment due to the inherent distributed management that has always been an integral component of Linux, with servers naturally operating in a ‘headless’ fashion.

How about Digg then? And what relevance does all this have for those of us who are mere mortal web site owners?

CONTINUED






Chances are if you’re reading iTWire then you’re also no stranger to Digg, a very popular Web 2.0 technology news website.

Digg combines social bookmarking, blogging and RSS syndication with non-hierarchical editorial control. Digg’s users submit stories for review and it is the users themselves who choose which stories go on the homepage, as determined by popular vote.

Digg works because it has a large user base – some 22 million individuals - who actively promote good stories to the homepage and vote down weaker stories or those with only a limited audience or amount of interest.

Digg isn’t quite up to Wikipedia’s level of general Internet awareness but it is still right up there as one of the top trafficked web sites on the Internet. Alexa ranks it at 276.

Digg handles billions of page requests a month. This necessitates a solid and reliable infrastructure. In fact, Digg is well known for having the ability to bring less-capable web sites to their knees when successfully hitting the Digg front page and receiving an abnormal spike of thousands upon thousands of new visitors. Yet, the ordinary daily usage of Digg far exceeds any traffic it sends to others and must cope with loads well beyond what other websites can only dream of.

Like Wikipedia, Digg opted for a Linux solution – which continues, despite an advertising deal with Microsoft.

Digg’s infrastructure is so massive that the Systems Engineering Lead would have difficulty giving you an exact count of the servers. There are web servers, and database servers, and even six specialised database servers just to implement the recommendation engine.

Debian Linux is used across the board and then a mixture of free open source software along with some custom-written specialised apps.

A request to Digg’s site hits a load balancer; this are a host of servers which balances incoming requests and cached data, and monitor each other to swiftly take on all the load of any server that might fail so users don’t even notice.

After the load balancer web requests are handed to application servers, which are a combination of Apache, PHP, Memcached and Gearman and serve up web pages and marshal database connections as required.

The databases are all MySQL and are broken up across four masters with a load of slaves. All database writes go to the masters and all reads to the slaves.

This setup works successfully for Digg and for its many millions of visitors. It provides massive uptime, enormous scalability and unbelievable reliability.

For many of us we can only but dream about having web traffic so popular that such high availability is a concern. However, even so, we can all still learn from the genuine lessons that sites like Wikipedia and Digg can teach us based on their vast real-world experience with data of such volume.

And one such lesson is that Linux simply works and can be counted on. If you want performance and reliability, think Linux.

Subscribe to Newsletter here

NEW OFFER - ITWIRE LAUNCHES PROMOTIONAL NEWS & CONTENT

Recently iTWire remodelled and relaunched how we approach "Sponsored Content" and this is now referred to as "Promotional News and Content”.

This repositioning of our promotional stories has come about due to customer focus groups and their feedback from PR firms, bloggers and advertising firms.

Your Promotional story will be prominently displayed on the Home Page.

We will also provide you with a second post that will be displayed on every page on the right hand side for at least 6 weeks and also it will appear for 4 weeks in the newsletter every day that goes to 75,000 readers twice daily.

POST YOUR NEWS ON ITWIRE NOW!

PROMOTE YOUR WEBINAR ON ITWIRE

It's all about Webinars.

These days our customers Advertising & Marketing campaigns are mainly focussed on Webinars.

If you wish to promote a Webinar we recommend at least a 2 week campaign prior to your event.

The iTWire campaign will include extensive adverts on our News Site itwire.com and prominent Newsletter promotion https://www.itwire.com/itwire-update.html and Promotional News & Editorial.

For covid-19 assistance we have extended terms, a Webinar Business Booster Pack and other supportive programs.

We look forward to discussing your campaign goals with you. Please click the button below.

MORE INFO HERE!

BACK TO HOME PAGE
David M Williams

David has been computing since 1984 where he instantly gravitated to the family Commodore 64. He completed a Bachelor of Computer Science degree from 1990 to 1992, commencing full-time employment as a systems analyst at the end of that year. David subsequently worked as a UNIX Systems Manager, Asia-Pacific technical specialist for an international software company, Business Analyst, IT Manager, and other roles. David has been the Chief Information Officer for national public companies since 2007, delivering IT knowledge and business acumen, seeking to transform the industries within which he works. David is also involved in the user group community, the Australian Computer Society technical advisory boards, and education.

BACK TO HOME PAGE

ZOOM WEBINARS & ONLINE EVENTS

GUEST ARTICLES

VENDOR NEWS

Guest Opinion

Guest Interviews

Guest Research & Case Studies

Channel News

Comments