Wednesday, 24 February 2016 16:26

Nimble puts data science to work for customers


A key part of Nimble Storage's strategy is InfoSight, its cloud-based analytics system covering storage and other aspects of the technology stack.

Today's infrastructures are too complex for anyone to really understand what's going on, and doing your own analytics is also more complex than most organisations care to deal with. So Nimble's InfoSight has the storage arrays collect a wide variety of system health data and send it to the company for analysis, with the results and recommendations made available to customers.

The array firmware was designed to reliably collect a wide range of performance-related data, and Nimble uses other vendors' APIs (notably those provided by VMware) to collect data about other aspects of the stack. Vice president of analytics and support Rod Bagg said Nimble is planning to add support for other hypervisors including Hyper-V, as well as collecting data from Windows and other operating systems.

In all, some 100 billion data points are collected every four hours from deployed systems.

Bagg (who wrote much of the original code for InfoSight) said the data collected by the company shows that only 46% of the problems that lead to users having to wait for applications to deliver the requested information are storage related. The rest are down to configuration issues (28%), interoperability issues (11%), not following best practices (8%) and host, compute or virtual machine issues (7%).

For analytics to work effectively, the data has to be accurate. "It doesn't take much to mess up an algorithm, warned data scientist Mark Cooke. Also, it can be difficult to predict which pieces of data will be relevant ahead of time, so you need to be confident that all the data being collected is accurate.

That's one of the reasons why Nimble arrays were designed from the outset to collect detailed data about their operation.

Developing and applying a variety of mathematical models to the data collected from its customers' systems means Nimble is able to deliver very high availability - currently 99.9997% - by recommending corrective action before potential issues become real.

A simple example is that once it has been determined that a NimbleOS update conflicts with a certain version of a hypervisor, that combination is blacklisted and the update won't be installed on arrays being used with that hypervisor version. Automated updating resumes once a subsequent release overcomes that conflict.

Data scientist Shannon Loomis likened some of these "weird corner case conditions" to recessive genes - it's only where two or more factors coincide that problems occur.

Another example outlined by Cooke was first seen when a customer's array was suffering from intermittently slow write performance. An investigation revealed it was caused by two drives simultaneously reaching a marginal condition where they both reported as being OK but weren't actually operating as normal. The arrays now detect this condition and fail a drive when it gets into that state so it gets replaced before performance really suffers.

The enormous data set allows rapid root cause analysis across the technology stack. Bagg described an example where a customer had been talking to another supplier about a performance problem for six weeks without resolution. Even though the customer did not think it was a storage issue, it turned to Nimble for help, and the problem was very quickly identified as a faulty network interface card.

Nimble's hardware design means that in most cases it is possible to upgrade for increased performance or capacity separately. Within the scalability limits of a particular array model, additional storage trays can be installed while retaining the current controller, or the controller can be upgraded without having to buy more storage. Either way, the upgrades can be performed non-disruptively.

The analytics are able to predict future hardware requirements early enough to suit a customer's procurement processes, and they are smart enough to reveal situations where a second limitation will soon come into play. For example, expanding the cache size may prove to be a necessary but only temporary fix if that soon results in a lack of CPU capacity.

It's also worth noting that such forecasts are not in the form of a spot value along the lines of "you will run out of storage space on this date," but as a prediction interval. The uncertainty is the important part of the forecast, not the predicted value, said Cooke.

That's because if you're probably going to need a controller upgrade sometime between 3 March and 27 May, it usually makes sense to maintain performance by scheduling the upgrade to occur before that period even if the most likely day for reaching controller saturation is sometime in April.

The models also simplify hardware sizing decisions when customers are about to buy their first Nimble array. The intended workloads are fed in, and the output is a 'shopping list' of nimble arrays along with an indication of the uncertainty around the predicted levels of resource consumption.

Nimble today launched its new all-flash AF series of storage arrays.

Disclosure: the writer travelled to San Francisco as a guest of Nimble.

WEBINAR event: IT Alerting Best Practices 27 MAY 2PM AEST

LogicMonitor, the cloud-based IT infrastructure monitoring and intelligence platform, is hosting an online event at 2PM on May 27th aimed at educating IT administrators, managers and leaders about IT and network alerts.

This free webinar will share best practices for setting network alerts, negating alert fatigue, optimising an alerting strategy and proactive monitoring.

The event will start at 2pm AEST. Topics will include:

- Setting alert routing and thresholds

- Avoiding alert and email overload

- Learning from missed alerts

- Managing downtime effectively

The webinar will run for approximately one hour. Recordings will be made available to anyone who registers but cannot make the live event.



Security requirements such as confidentiality, integrity and authentication have become mandatory in most industries.

Data encryption methods previously used only by military and intelligence services have become common practice in all data transfer networks across all platforms, in all industries where information is sensitive and vital (financial and government institutions, critical infrastructure, data centres, and service providers).

Get the full details on Layer-1 encryption solutions straight from PacketLight’s optical networks experts.

This white paper titled, “When 1% of the Light Equals 100% of the Information” is a must read for anyone within the fiber optics, cybersecurity or related industry sectors.

To access click Download here.


Stephen Withers

joomla visitors

Stephen Withers is one of Australia¹s most experienced IT journalists, having begun his career in the days of 8-bit 'microcomputers'. He covers the gamut from gadgets to enterprise systems. In previous lives he has been an academic, a systems programmer, an IT support manager, and an online services manager. Stephen holds an honours degree in Management Sciences and a PhD in Industrial and Business Studies.



Recent Comments