Scientific Data in the Cloud

by Florian Hauer – labfolder.

With amounts of scientific data growing dramatically, so does the need for processing and storing this data somewhere. Particle Colliders, telescopes, new sequencing and high-throughput analyses create amounts of data that are predicted to exceed the world wide digital storing capacity in 2015.

Facing this challenges, scientists are looking for new ways of processing and storing large quantities of complex data. Currently, cloud services are increasingly popular among customers and enterprises. But can the cloud also offer solutions for the handling of scientific date?

What is the cloud ?

The scientific challenge of grinding through massive amounts of data has been persisted ever since, and one very successful approach has been combining a high number of computers into a grid or a cluster. How is the cloud different? Clusters usually are defined as a parallel and distributed computer system with high interconnection bandwidth, and grids are defined as connected clusters which may be geographically far apart from each other 1.

Clouds, in turn, can rather be defined as a dynamic network of computers and virtual machines, which are offered as a more or less unified computational environment based on the service-level agreement (SLA) between the service provider and the user 1. Depending on the business model and the agreement, clouds capacities might be used as a platform or infrastructure for data storage and processing (Platform- or Infrastructure-as-a-Service, IaaS and PaaS), or as a hosted system to provide software functionalities (SaaS).

Cloud capacities are offered by a number of companies, including Amazon Elastic Compute Cloud (EC2), the Google Compute Engine, Microsoft Azure and Sun´s Network.com (for an overview, see 1)

Scientific cloud applications

The idea behind using the cloud in sciences is that researchers and groups can rapidly employ large capacities of computational power without having to buy and administer the hardware that is necessary. Usually, scientists have fluctuating periods of high computational workload. Cloud computing could allow scientists to only pay for the resources and save high maintenance costs during the low seasons of data harvest. A recent study showed that if there is no extended need for the long-time storage of large amounts of data, using commercial cloud capacities may be a cost-effective way to flexibly handling typical scientific data challenges 2.

There is a number of examples where scientists have used and are still using commercial clouds:

Apart from commercial cloud platforms, there is a number of dedicated scientific platforms coming up. The intention of these platforms is to provide an infrastructure which is optimized towards the quick deployment of scientific workflows. FutureGrid3, Eucalyptus4 and Helix Nebula5 are only a few examples of dedicated scientific cloud frameworks.

The increasing importance of cloud computing is impressively reflected in the numbers of publications found on pubmed using the tearm ‘cloud computing’: From virtually no publications mentioning cloud computing until 2008, the number has dramatically risen in a few years, with a growth curve indicating that we are only seeing the beginning of a trend with a huge impact.

no of annual publ abt cloud computing

Security in the cloud

In science, where the data value can be extremely high, security concerns are well justified. Since data transfer from and to the cloud includes communication via the internet, the communication might be intercepted. On the cloud server, unauthorized access to the data may be attempted by other tenants of the cloud, by the cloud provider or by third parties which gain access to the data storage systems. Furthermore, data loss and unintentional deletion by the service provider may occur.

Although the cloud is different from classical online storage systems because of its multi-tenant architecture, the security measures do not differ much. As an effective means, encryption can be used at any level to protect unauthorized reading access on any level. Thus, it is possible to encrypt the user-cloud communication or the content stored on the cloud – either by the user before upload or by the service provider, using a key that is known only to the user. Unauthorized access by other tenants can effectfully prevented by deploying strong virtual machine managers and operating systems that ensure separation between processes. Regular backups including remote backups have proven an effective measure against data loss. Taken together, the cloud does not impose any technical security risks that are per se new or specific to the cloud. Moreover, standard solutions for these problems exist (for review, see 6).

New to the concept of the cloud, however, is that the service provider technically does have access to the data stored on the cloud by user. Depending on the Service License Agreement (SLA), the cloud service provider might be allowed to perform undesired activities with the user´s data. Thus, it is one of the priorities of scientific cloud users to ensure that according to the Service License Agreement, they retain full ownership of the data, including the sole authority to copy, delete or manipulate the data. Any access to the data, including access by the service provider, should be tracked carefully in order to ensure a correct fulfillment of the Service License Agreement7.

Third-party access, e.g. by the legislative of the cloud server´s country of installation is another issue to keep in mind when choosing a cloud service. In several countries like Russia, China or even the UK and the US, legislative access to cloud server data is relatively easy with even decryption being able to be enforced8. In other countries like most European Countries, Canada, Australia or Japan and others, legislative organs may access the cloud only with a warrant8. Scientists in search for a suitable cloud should thus carefully check whether a cloud provider can tell in which countries the servers are located and whether the legislative framework in these countries suits the needs of the safety policy the scientist whishes to apply for his data.

Conclusions

Cloud computing offers many benefits for scientific applications and is predicted to grow in importance. The growing importance and prevalence of scientific cloud services, however, does not liberate scientists from carefully checking the Service License Agreement of the provider as well as the server and data locations together with the local jurisdiction. This is especially true if legally sensitive data (such as patient´s personal data) is stored and processed in the cloud. Technically, the cloud is safe even for demanding scientific architectures, and a growing number of service providers will adapt their safety measures and Service License Agreements progressively towards the needs of scientists.

Bibliography

1. Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J. & Brandic, I. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems 25, 599–616 (2009).

2. Berriman, G. B., Deelman, E., Juve, G., Rynge, M. & Vöckler, J.-S. The application of cloud computing to scientific workflows: a study of cost and performance. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences 371, 20120066 (2013).

3. FutureGrid

4. Eucalyptus

5. Helix Nebula

6. Ryan, M. D. Cloud computing security: The scientific challenge, and a survey of solutions. The Journal of Systems & Software (2013).doi:10.1016/j.jss.2012.12.025

7. Djemame, K. et al. Legal issues in clouds : towards a risk inventory Legal issues in clouds : towards a risk inventory. (2013).

8. Business Software Alliance – BSA: Global Cloud Computing Scorecard (2012) 

Leave a comment