Today's infrastructures are too complex for anyone to really understand what's going on, and doing your own analytics is also more complex than most organisations care to deal with. So Nimble's InfoSight has the storage arrays collect a wide variety of system health data and send it to the company for analysis, with the results and recommendations made available to customers.
The array firmware was designed to reliably collect a wide range of performance-related data, and Nimble uses other vendors' APIs (notably those provided by VMware) to collect data about other aspects of the stack. Vice president of analytics and support Rod Bagg said Nimble is planning to add support for other hypervisors including Hyper-V, as well as collecting data from Windows and other operating systems.
In all, some 100 billion data points are collected every four hours from deployed systems.
For analytics to work effectively, the data has to be accurate. "It doesn't take much to mess up an algorithm, warned data scientist Mark Cooke. Also, it can be difficult to predict which pieces of data will be relevant ahead of time, so you need to be confident that all the data being collected is accurate.
That's one of the reasons why Nimble arrays were designed from the outset to collect detailed data about their operation.
Developing and applying a variety of mathematical models to the data collected from its customers' systems means Nimble is able to deliver very high availability - currently 99.9997% - by recommending corrective action before potential issues become real.
A simple example is that once it has been determined that a NimbleOS update conflicts with a certain version of a hypervisor, that combination is blacklisted and the update won't be installed on arrays being used with that hypervisor version. Automated updating resumes once a subsequent release overcomes that conflict.
Data scientist Shannon Loomis likened some of these "weird corner case conditions" to recessive genes - it's only where two or more factors coincide that problems occur.
Another example outlined by Cooke was first seen when a customer's array was suffering from intermittently slow write performance. An investigation revealed it was caused by two drives simultaneously reaching a marginal condition where they both reported as being OK but weren't actually operating as normal. The arrays now detect this condition and fail a drive when it gets into that state so it gets replaced before performance really suffers.
The enormous data set allows rapid root cause analysis across the technology stack. Bagg described an example where a customer had been talking to another supplier about a performance problem for six weeks without resolution. Even though the customer did not think it was a storage issue, it turned to Nimble for help, and the problem was very quickly identified as a faulty network interface card.
Nimble's hardware design means that in most cases it is possible to upgrade for increased performance or capacity separately. Within the scalability limits of a particular array model, additional storage trays can be installed while retaining the current controller, or the controller can be upgraded without having to buy more storage. Either way, the upgrades can be performed non-disruptively.
The analytics are able to predict future hardware requirements early enough to suit a customer's procurement processes, and they are smart enough to reveal situations where a second limitation will soon come into play. For example, expanding the cache size may prove to be a necessary but only temporary fix if that soon results in a lack of CPU capacity.
It's also worth noting that such forecasts are not in the form of a spot value along the lines of "you will run out of storage space on this date," but as a prediction interval. The uncertainty is the important part of the forecast, not the predicted value, said Cooke.
That's because if you're probably going to need a controller upgrade sometime between 3 March and 27 May, it usually makes sense to maintain performance by scheduling the upgrade to occur before that period even if the most likely day for reaching controller saturation is sometime in April.
The models also simplify hardware sizing decisions when customers are about to buy their first Nimble array. The intended workloads are fed in, and the output is a 'shopping list' of nimble arrays along with an indication of the uncertainty around the predicted levels of resource consumption.
Nimble today launched its new all-flash AF series of storage arrays.
Disclosure: the writer travelled to San Francisco as a guest of Nimble.