Another feature of cloud computing is that providers typically charge separately for storage and processing. So researchers at Swinburne University of Technology have been exploring the management of raw and intermediate data.
Is it better to keep the raw data and recreate intermediate datasets as required, or should you keep both? "The trade-off is going to be between storage cost and computation cost," said John Grundy, who works in the University's Center for Computing and Engineering Software Systems (SUCCESS). "Finding this balance is complex, and there are currently no decision-making tools to advise on whether to store or delete intermediate datasets, and if to store, which ones."
Funded by the Australian Research Council, Prof Grundy, Yun Yang and Jinjun Chen (who is now with the University of Technology, Sydney) have developed a mathematical model that takes into account the size of the original dataset, the amount of intermediate data stored, and the rates charged by service providers.
What adds to the complexity is that intermediate datasets are not necessarily generated directly from the original data, but from intermediate results. So the team also developed an intermediate data-dependency Graph (IDG) to helps users decide whether they are better off spending money on storage or computation for intermediate datasets.
Prof Yang pointed out that data sets can be huge. Astronomers may log as much as 1GB per second. The researchers produced six intermediate datasets from a particular astronomical dataset, and determined the costs of regenerating or storing them based on Amazon's published prices.
The minimum cost for one hour of observation data from the telescope and storing intermediate data for 30 days was $200; for storing no data and regenerating when needed, $1000; and for storing all intermediate data, $390.
"We could delete the intermediate datasets that were large in size but with lower generation expenses, and save the ones that were costly to generate, even though small in size," Prof Yang said.
The researchers are woking on models that will allow these decisions to be made on the fly.
The research is not only applicable to public cloud services such as Amazon. Such models also could be employed by users of internal IT services that are charged on a utility basis.