Wikipedia is rated by Alexa as having a traffic rank of 8. Alexa is an independent website monitoring company who provide site ranking information based on a variety of sources over a rolling three month period. Primarily, they measure the number of individual pages visited by individual people.
By rating Wikipedia an ‘8’ Alexa are saying that Wikipedia is the 8th most popular web site out of every site the company collects data on. For contrast, my personal blog – low traffic, little content, updated infrequently – has a paltry rank of 7,476,670. This site, iTWire.com, is ranked 39,326. If you’re interested, here’s an interesting article comparing Wikipedia’s Alexa ranking against many, many other web sites.
If you have a site as massive as Wikipedia, hit as many times a day by as many people as it, what are you going to do? We’re talking more than 10 million distinct articles, in 250 different languages, served to over 684 million people per year.
The answer is that Wikipedia have a massive server farm driving their web site – some 400 servers, in fact. And each and every one runs Linux.
Previously, Wikipedia had a mix of Red Hat and Fedora Linux installations. These aren’t unsurprising choices. Red Hat Linux is a very well known and popular enterprise-grade server product. Fedora is also by the same company; it is Red Hat’s community-supported distribution which omits some of its big brother’s grunt.
You’ll appreciate that the systems administration burden of managing upgrades and patches and software compatibilities is made easier if every server has the exact same platform. Therefore, Wikipedia decided to migrate every server to a single Linux distribution across the board. They chose Ubuntu Linux, which is arguably the most popular Linux release today.
Now, here’s something I really want to hit on: how many systems administrators does it take to manage 400 Linux servers? Let me tell you!
That’s right, it takes just five full time IT staff to run each and every one of Wikipedia’s 400 Linux servers.
Now, I don’t want to be presumptuous, but do you believe one single Windows administrator can maintain 80 servers by him or herself? What do you think the effort involved in distributing patches would be? Or, perhaps I should ask what would be the cost, if you chose to implement Microsoft Systems Centre or Symantec Altiris or other such big hitters?
These servers perform a variety of purposes. There are the primary web servers, database servers and also caching proxy servers. The move to Ubuntu wasn’t effortless, and in fact Wikipedia broke this project up over almost two years.
However, the results have been tremendous. Wikipedia’s Chief Technology Officer Brion Vibber said that everything “has gotten a lot simpler. Mass upgrades can be done more easily, and the data center can be managed as a unit.”
“We can run the same combination everywhere, and it does the same thing. Everything is a million times easier,” he said.
Some commentators online have criticised the choice of Ubuntu Linux. Make no mistake, though; there’s not been anyone saying Wikipedia should have opted for any version of Microsoft’s Windows server products but instead debate arose over whether CentOS Linux – a free distribution which is compiled from the Red Hat Enterprise sources – or other versions would have been a better choice.
Nevertheless, Ubuntu or not, the fact remains that it is Linux which Wikipedia chose to run its server farm, spread over data centres in Tampa, Florida, and within South Korea and Amsterdam.
Linux has proven successful for Wikipedia. They have enjoyed reliable uptime and performance. They have scaled as their article count and user base has grown.
And importantly, for a non-profit foundation, there is no software cost for Wikipedia whatsoever. Can you imagine just how much a Windows solution would cost? 100% of the funding for infrastructure can be spent on hardware, bandwidth and server hosting.
What’s more, huge slabs of administration can be scripted and automated in a fashion far easier and more elegantly than is permissible in a Windows environment due to the inherent distributed management that has always been an integral component of Linux, with servers naturally operating in a ‘headless’ fashion.
How about Digg then? And what relevance does all this have for those of us who are mere mortal web site owners?
Chances are if you’re reading iTWire then you’re also no stranger to Digg, a very popular Web 2.0 technology news website.
Digg works because it has a large user base – some 22 million individuals - who actively promote good stories to the homepage and vote down weaker stories or those with only a limited audience or amount of interest.
Digg isn’t quite up to Wikipedia’s level of general Internet awareness but it is still right up there as one of the top trafficked web sites on the Internet. Alexa ranks it at 276.
Digg handles billions of page requests a month. This necessitates a solid and reliable infrastructure. In fact, Digg is well known for having the ability to bring less-capable web sites to their knees when successfully hitting the Digg front page and receiving an abnormal spike of thousands upon thousands of new visitors. Yet, the ordinary daily usage of Digg far exceeds any traffic it sends to others and must cope with loads well beyond what other websites can only dream of.
Like Wikipedia, Digg opted for a Linux solution – which continues, despite an advertising deal with Microsoft.
Digg’s infrastructure is so massive that the Systems Engineering Lead would have difficulty giving you an exact count of the servers. There are web servers, and database servers, and even six specialised database servers just to implement the recommendation engine.
Debian Linux is used across the board and then a mixture of free open source software along with some custom-written specialised apps.
A request to Digg’s site hits a load balancer; this are a host of servers which balances incoming requests and cached data, and monitor each other to swiftly take on all the load of any server that might fail so users don’t even notice.
After the load balancer web requests are handed to application servers, which are a combination of Apache, PHP, Memcached and Gearman and serve up web pages and marshal database connections as required.
The databases are all MySQL and are broken up across four masters with a load of slaves. All database writes go to the masters and all reads to the slaves.
This setup works successfully for Digg and for its many millions of visitors. It provides massive uptime, enormous scalability and unbelievable reliability.
For many of us we can only but dream about having web traffic so popular that such high availability is a concern. However, even so, we can all still learn from the genuine lessons that sites like Wikipedia and Digg can teach us based on their vast real-world experience with data of such volume.
And one such lesson is that Linux simply works and can be counted on. If you want performance and reliability, think Linux.