Thursday, 25 October 2018 09:49

Amazon's Vogels claims outage report 'silly and misleading'

Amazon's Vogels claims outage report 'silly and misleading' Pixabay

Amazon chief technology officer Werner Vogels has claimed an article by CNBC about problems faced by the company's website on Prime Day, its made-up holiday in July meant to boost sales, had nothing to do with Amazon Web Services and the new database that Amazon is using. He described it as "silly and misleading".

Vogels did not mention the fact that the CNBC article — which was quoted by iTWire among other publications — had actually said that the problem had to do with the process of migration from Oracle's database to PostgreSQL, an open-source database sold to Amazon by a provider named Aurora.

In a tweet, to which he had appended a statement, Vogels said the company's fulfilment centres had migrated 92% of its databases from Oracle to PostgreSQL "with better availability, less bugs and patches, less troubleshooting and less cost".

CNBC said 13 warehouses had moved from Oracle to PostgreSQL before Prime Day. It made no mention of AWS in its report.

"Never let facts interrupt a 'good story', Vogels wrote. "Tried to help reporter get it right, but clickbait won."

The CNBC report was based on internal Amazon documents running to 25 pages which showed that the retail giant's technical staff struggled to find out the root cause of the issues on Prime Day.

Vogels said the website issues had resulted from a problem in the Amazon retail software stack. He did not specify which part of the stack was affected.

The internal documents that CNBC cited were said to have been "apparently obtained" and, according to Vogels, detailed "a completely unrelated issue in a single fulfilment centre (out of more than 185 worldwide)".

This had "led to the slowing of processing in the fulfilment operations and a slight delay in shipping of products from that facility alone", Vogels claimed, not giving the location of the facility in question. CNBC had said the Prime Day issues had arisen at a time when Amazon had an issue at one of its bigger warehouses in Ohio.

Vogels said: "There was never an outage at the facility, and the issue only resulted in delaying shipping of about 1% of packages for a short period of time (unnoticeable to customers)."

CNBC had said that 15,000 packages had been delayed as per the internal Amazon report and about US$90,000 in labour costs wasted.

It did not claim that there was an outage at the Ohio facility in the story, though the headline did make this claim. But the Amazon website did go down as per a report in the The Verge which said it had crashed. Such a failure is commonly referred to as an outage.

Vogels said the application in question — which he did not name — "accidentally created an excessive number of savepoints, despite the team knowing that Aurora and Oracle handle savepoints differently".

This appears to tally with the CNBC report which had said that the Prime Day issue was connected to the fact that Oracle and PostgreSQL handle savepoints differently.

"This created a temporary situation where the database was very slow and the application experienced intermittent timeouts," Vogels wrote.

The internal Amazon documents said the "degradation resulted in lags and complete outages".

Said Vogels: "The problem was quickly diagnosed and completely resolved by simply removing the unnecessary savepoints that had been inadvertently left in the retail application. No changes were required in Aurora."

CNBC said Amazon had failed to come up with a contingency plan in the event of any error in its replacement database. Vogels made no mention of this in his statement.

WEBINAR event: IT Alerting Best Practices 27 MAY 2PM AEST

LogicMonitor, the cloud-based IT infrastructure monitoring and intelligence platform, is hosting an online event at 2PM on May 27th aimed at educating IT administrators, managers and leaders about IT and network alerts.

This free webinar will share best practices for setting network alerts, negating alert fatigue, optimising an alerting strategy and proactive monitoring.

The event will start at 2pm AEST. Topics will include:

- Setting alert routing and thresholds

- Avoiding alert and email overload

- Learning from missed alerts

- Managing downtime effectively

The webinar will run for approximately one hour. Recordings will be made available to anyone who registers but cannot make the live event.



Security requirements such as confidentiality, integrity and authentication have become mandatory in most industries.

Data encryption methods previously used only by military and intelligence services have become common practice in all data transfer networks across all platforms, in all industries where information is sensitive and vital (financial and government institutions, critical infrastructure, data centres, and service providers).

Get the full details on Layer-1 encryption solutions straight from PacketLight’s optical networks experts.

This white paper titled, “When 1% of the Light Equals 100% of the Information” is a must read for anyone within the fiber optics, cybersecurity or related industry sectors.

To access click Download here.


Sam Varghese

website statistics

Sam Varghese has been writing for iTWire since 2006, a year after the site came into existence. For nearly a decade thereafter, he wrote mostly about free and open source software, based on his own use of this genre of software. Since May 2016, he has been writing across many areas of technology. He has been a journalist for nearly 40 years in India (Indian Express and Deccan Herald), the UAE (Khaleej Times) and Australia (Daily Commercial News (now defunct) and The Age). His personal blog is titled Irregular Expression.



Recent Comments