Technology and Infrastructure Support for Large Scale Information
Marcio FaermanThe Brazilian National Education and Research Network - [email protected]
Generating Large Data Collections• Large Data Volumes can be generated much faster
than they can be analyzed– Instrument Observations
• Particle Accelerators (Cern LHC)• Telescopes, Satellites• Sensor Networks• Virtual Observatories
– Large Model Simulations• High resolution, Very complex
• Scientific Experiments– medical imaging (fMRI): ~ 1 GByte per measurement (day)– Bio-informatics queries: 500 GByte per database– Satellite world imagery: ~ 5 TByte/year– Current particle physics: 1 PByte per year– LHC physics (2007): 10-30 PByte per year– LSST Astronomy (2012): 5 PBytes per year
Challenges Managing Large Volume Data• Scalability
– What works for small datasets does not necessarily work for large collections
• Data Integrity– At a terabyte scale failures and data corruption are very likely to occur– Is data provenance reliable?
• Efficiency– Data should be accessed at a rate which keeps work feasible– More data – need for more speed
• Distributed Access– Data can be at remote (and possibly unknown) location
• Infrastructure Management– Heterogeneous– Distributed– Prone to failures– Very Complex
Challenges – Getting to Know your Data
• Extract knowledge from raw data files– Data product derivation
• Vizualization• Relationships• Patterns • New derived quantities
– Cross institutional and cross disciplinary collaborations• What if experiments
– Your data with our model?
• Dataset Access– Multiple formats
• Each sensor, simulation has its own storage format
– Federated collections
– Discovery by content
Technological Response
• Integration of compute, communication, storage and instrument resources into a powerful infrastructure – Information Grids– Very powerful infrastructure– Economy of scale
• Serves broad range of customers– biologists, pysicists, government, industry
• Infrastructure is heterogeneous, distributed, very complex
• Middleware and Data Oriented tools act as facilitators to tackle data management complexities
Open Access and Preservation Functionalities• Federated Digital Libraries
– Integration of distributed repositories– Access control – can decide who can see it– Organize the data in collections– Describe your data – Metadata
• Data Grids– Access to efficient parallel I/O systems– Hierarchical Systems
• Disk caches, tapes• Often Distributed
– Analysis, Data Mining– Visualization– Workflow based systems– Transaction based data ingestion
• Data provenance, Data fingerprinting– What if virtual lab
• End User Oriented Portals– "I deal with the data in the way it makes sense to me"
Middlewares and Tools
• Data Management– Storage Resource Broker (SRB)– Globus Data Management– L-Store– IBP– Storage Resource Manager (SRM)
• Data Representation Libraries– HDF5– NetCDF
• Portals– OGCE– JSR 168
Today’s Reality
• Exceptional achievements by early adopters
• Integration between domain scientists – data users and producers still a challenge– Need much more cross-disciplinary interaction
• Emphasis on scale and performance• Failures are still a taboo
– Frustration factor should be addressed in partnership with users
– Focus on failure recovery and quality of service getting more attention
e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007 9
Grid Initiatives around the World
HEPGrid
Ringrid
EELA
SPRACE
UCRAV
OurGrid
UNAM
SINAPAD
CL Grid
Networking in Latin America
RNP-BRREUNA-CL
CUDI-MX
RAAP-PE
REACCIUN-VE
12
Brazilian National Research And Education Network - RNP
• In November 2005 the RNP networking infrastructure was entirely renovated.
It consists of
• A multigigabit core connecting 10 capitals at 2.5 and 10 Gbps
• Connections at 34 Mbps to 11 capitals
• Connections up to16 Mbps to 6 capitals
Infra-estrutura para e-Ciência 13
Communitary Metropolitan Networks
• It is not enough to bring high speed connectivity to each city – it is necessary bring it to the university campus / research lab as well.
• The metropolitan network is the solution– Infrastructure sharing to support:
• Campi interconnection of each partner institution• Access to RNP national network backbone
– This sharing substantially reduces deployment costs– Preferably, the infrastructure will be owned by the partners
themselves (reducing operating costs)
• Pilot: The Metrobel project in the city of Belém do Pará in the Amazon region
Metrobel – Belém Metropolitan Network
Infra-estrutura para e-Ciência 15
Redecomep Project(2005-7)
• Following Metrobel, Brazilian Ministry of Science and Technology is supporting the Communitary Networks for Education and Research (Redecomep) Project, with a R$ 39,7 M (~ U$ 19,0 M) through Finep (dec/2004)
• Goals:– Extend the metropolitan optical network to other
26 cities with RNP points of presence– Promote integration in metropolitan area– High speed access to RNP point of presence
Next steps
• Integration between network, data repositories, compute, storage resources and applications– Identify who needs better connectivity
– Developing Brazilian cyberinfrastructure
– Generally uncoordinated funding for infrastructure resources
– Need broad vision at funding agencies and partners level of application requirements and cyberinfrastructure integration
• RNP articulating with scientific communities and infrastructure providers e-Science/Infrastructure initiative in Brazil
e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007 17
JRU- Brazil: 22 members in EELA-2 # STATE INSTITUTION E-SCIENCE COMMUNITIES
1 SP CCE / USP (e-INFRASTRUCTURE only)
2 RJ CEFET-RJ e-GOVERNMENT, E-INDUSTRY
3 RJ FCM / UERJ BIOMED
4 RJ FIOCRUZ BIOMED, e-EDUCATION
5 SP IAG / USP CLIMATE
6 RJ IME BIOMED
7 SP INCOR / USP BIOMED
8 SP INPE CLIMATE
9 RJ LNCC BIOMED
10 RJ ON PHYSICS
11 BR RNP (NREN) (e-INFRASTRUCTURE only)
12 SP SPRACE / UNESP PHYSICS
13 PB UFCG CLIMATE, EARTH-SCIENCE
14 RJ UFF (e-INFRASTRUCTURE only)
15 MG UFJF BIOMED
16 MS UFMS BIOMED
17 RS UFRGS CLIMATE
18 RJ UFRJ (coordinator for EELA-2) BIOMED, PHYSICS, e-EDUCATION, CLIMATE
19 RS UFSM CLIMATE
20 DF UnB BIOMED
21 RJ UNILASALLE e-EDUCATION
22 SP UNISANTOS BIOMED, E-LEARNING, e-GOVERNMENT
Developing Together
• Information infrastructure is being redefined in Brazil and Latin America
• Now is the time to have as much cross-disciplinary interaction as possible to define needs, partnerships and investments
• Please contact us
THANK YOU!