data grids

21
Data Grids Darshan R. Kapadia Gregor von Laszewski http://grid.rit.edu 1

Upload: ilana

Post on 23-Feb-2016

66 views

Category:

Documents


0 download

DESCRIPTION

Data Grids. Darshan R. Kapadia Gregor von Laszewski. Grids. We’ve seen computational grids – collections of computing clusters and protocols/software in order to submit jobs, distribute work, schedule jobs, monitor status, etc. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Grids

http://grid.rit.edu

Data Grids

Darshan R. KapadiaGregor von Laszewski

1

Page 2: Data Grids

http://grid.rit.edu 2

Grids

• We’ve seen computational grids – collections of computing clusters and protocols/software in order to submit jobs, distribute work, schedule jobs, monitor status, etc.

• But how do we manage collections of data on a grid – not just the computations / programs themselves?

Page 3: Data Grids

http://grid.rit.edu 3

Data GRID

1. Lothar A T Bauerdick (2003). Grid Tools and the LHC Data Challenges. LHC Symposium. May 3, 2003.

Page 4: Data Grids

http://grid.rit.edu 4

Why data grids?

• The immense computational demands of many scientific applications are often coupled with massive amounts of data.

• These data sets must be shared by a virtual organization (or multiple VOs) for a variety of computations

• Distributing jobs to diverse geographic computing resources also requires distributing data collections for processing and storing output.

Page 5: Data Grids

http://grid.rit.edu 5

Data Grid Challenges

• Storage capacity for massive quantities of data • Distribute data sets to disperse geographic

locations to complete jobs in a grid• Maximize computation to communication

ratio• Aggregation of results, data coherency– Who has “the” copy of the data set

• Need to do all of this securely and robustly

Page 6: Data Grids

Functions of Data GRID

• Data Access– How do we access and manage data?

• Storage Resource Brokers• UNIX File Systems, Distributed File Systems, HTTP servers, etc

– How do we transfer data?• Metadata Access

– Data about data!• Replica Management

– Create/delete copies of data– Replica “catalogs”

• Replica Selection– Locating the best data replica to use for an application– Determine subset of data required for a job

Page 7: Data Grids

Earth System GRID

• The Earth System Grid (ESG) integrates supercomputers with large-scale data and analysis servers located at numerous national labs and research centers to create a powerful environment for next generation climate research.

• Participating Organization– Argonne National Laboratory– Lawrence Berkeley National Laboratory– Lawrence Livermore National Laboratory– Los Alamos National Laboratory– National Center for Atmospheric Research– Oak Ridge National Laboratory– University of Southern California/Information Sciences Institute

http://www.earthsystemgrid.org/

Page 8: Data Grids

http://grid.rit.edu 8

High Energy Physics Application

B. Allcock J. Bester, B. C. F. K. M. N. Q. T. J. . A. L. . I. . C. . S. . V. . D. . S. (2002). Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal, 28(5), 749-771.

Page 9: Data Grids

http://grid.rit.edu 9

Data GRID Architecture

1. Chervenak, A., Deelman, E., Kesselman, C., Allcock, B., Foster, I., & Nefedova, V., et al. (2003). High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Comput., 29(10), 1335-1356.

Page 10: Data Grids

http://grid.rit.edu 10

Data Grid Design

• Mechanism Neutrality• Policy Neutrality• Compatibility with Grid Infrastructure• Uniformity of Information Infrastructure

Page 11: Data Grids

http://grid.rit.edu 11

Core Data GRID services

• Storage System and Data Access– Data Abstraction: Storage System– Data Access

• Metadata Services

Page 12: Data Grids

http://grid.rit.edu 12

High Level Data Grid Components

• Replica Management• Replica Selection and Data Filtering

Page 13: Data Grids

http://grid.rit.edu 13

GASS

• Globus Access to Secondary Storage [5]– NOT a distributed file system– Unix (C-style) fopen/fclose– Default behavior is to transfer entire file from

remote site into a local cache when file is opened– GASS also provides finer-tuned control. • Pre-stage/Post-stage file accesses• Cache management

– No cache coherency (changes made to remote file do not get propagated to caches)

Page 14: Data Grids

http://grid.rit.edu 14

Contd..

Commands• globus_gass_fopen• globus_gass_fclose

• File names are URLs

Page 15: Data Grids

http://grid.rit.edu 15

GridFTP

• GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth, wide-area networks.

• Based on FTP (RFC-959)• Extended for higher-performance, flexibility,

and robustness– Parallel data sources, parallel transfers– Partial file transfers– Transfer restart capabilities

Page 16: Data Grids

GridFTP

• Can Use GSI for security.• TeraGrid has three clients which utilize GridFTP – UberFTP(recommended)– Globus-url-copy(preferred for scripting)– tgcp (deprecated)

Page 17: Data Grids

Amazon Simple Storage Service (Amazon S3™)

• Amazon S3 is storage for the Internet.

• Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.

Page 18: Data Grids

http://grid.rit.edu 18

AWS S3 Functionalities

• Write, read, and delete objects containing from 1 byte to 5 gigabytes of data each. The number of objects you can store is unlimited.

• Each object is stored in a bucket and retrieved via a unique, developer-assigned key.

• Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users.

• Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit.

Page 19: Data Grids

http://grid.rit.edu 19

Replica Management

A Taxonomy of Data Grids for Distributed Data Sharing, Management,and Processing KUMAR VENUGOPAL, RAJKUMAR BUYYA, AND KOTAGIRI RAMAMOHANARAO

Page 20: Data Grids

Conclusion

• Data Grid involves maintenance of large amount of data, So it is unique in terms of its architecture.

• Data Grid are very important for the future as large amount of data will be required for future applications.

Page 21: Data Grids

http://grid.rit.edu 21

References1. http://www.earthsystemgrid.org/2. Chervenak, A., Deelman, E., Kesselman, C., Allcock, B., Foster, I., &

Nefedova, V., et al. (2003). High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Comput., 29(10), 1335-1356.

3. Allcock, W., Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2001). The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 23, 187-200.

4. Bester, J., Foster, I., Kesselman, C., Tedesco, J., & Tuecke, S. (1999). GASS: A Data Movement and Access Service for Wide Area Computing Systems. Paper presented at the Proceedings of IOPADS'99.

5. B. Allcock J. Bester, B. C. F. K. M. N. Q. T. J. . A. L. . I. . C. . S. . V. . D. . S. (2002). Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal, 28(5), 749-771.