developing a reference architecture for scientific data systems
DESCRIPTION
Developing a Reference Architecture for Scientific Data Systems. Dan Crichton April 2009. Background. Employed by Jet Propulsion Laboratory since 1995; prior software engineering positions at Hughes Aircraft Company and in private industry MS in Computer Science, USC 1996 Program Manager for - PowerPoint PPT PresentationTRANSCRIPT
Dan Crichton
April 2009
BackgroundEmployed by Jet Propulsion Laboratory since 1995; prior software
engineering positions at Hughes Aircraft Company and in private industry
MS in Computer Science, USC 1996
Program Manager for Planetary Data System Engineering in Solar System Exploration Directorate Data Systems and Technology in Earth and Technology Directorate
Principal Investigator for Informatics Center, Early Detection Research Network, National Cancer
Institute Facilitating Integration of NASA and Earth System Grid, NASA Object Oriented Data Technology
Science data systemsCovers a wide variety of science disciplines
Solar system exploration AstrophysicsEarth scienceBiomedicineetc
Each has its own communities, standards and systems
How do you define a reference architecture vs a point solution?
DJC-4
External Science
Community
Data Acquisition
and CommandMission
OperationsInstrument /Sensor Operations
ScienceData
Archive
ScienceData
Processing
Data Analysis and
Modeling
Science Information Package
Science Team
Relay Satellite
Spacecraft / lander
Spacecraft andScientific Instruments
Primitive Information Object
Primitive Information Object
Simple Information Object
Telemetry Information Package
Science Information Package
Instrument Planning
Information Object
Science Information Package
Science Products - Information Objects
PlanningInformation
Object
Science Information Package
• Common Meta Models for Describing Space Information Objects• Common Data Dictionary end-to-end
DJC-5
Increasing data volumes
Increased emphasis on usability and analysis of the data across the end-to-end system
Mining/discovery
Increasing diversity of data sets and complexity for integrating across missions/experiments (E.g., common information model for describing the data)
Increasing distribution of coordinated processing and operations (E.g., federation)
Increased pressure to reduce cost of supporting new missions
Increasing desire for PIs to have integrated tool sets to work with data products with their own environments (E.g. perform their own generation and distribution)
Archive Volume Growth
010
203040
506070
8090
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Year
TB (A
ccum
)
TBytes
Planetary Science Archive
Architecture: why do I care?Data system costs per mission, project,
investigation, etc is high
Technology infusion is limited
Need to capture and leverage domain knowledge and experience across projects
Architecture: what is it?The fundamental organization of a system
embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000)
Architects: what are they?Effective Architects have…
• Years of experience
• Holistic view of domain – Look at both aesthetics and
practical details– Variable technical depth
• Lifecycle roles– Strong involvement up-front– May oversee development– Chooses stable steps in
development
Effective Architects are not…
• Lone inventors or scientists– The architect is a good
communicator and politician -- architectures must be sold and explained and their integrity maintained
– Architecting is not a science, but depends on science
• Purely technologists• Architecture is a strategy
• “Top level only” designers– Details are often critical
• Collaborators– A coherent vision is critical;
they drive it
9
• A viewpoint is a template for constructing a view
• A view is a description of the entire system from the perspective of a set of related concerns. A view is composed of one or more models.
• A model is an abstraction or representation of some aspect of a thing
• Examples: RM-ODP, FEAF, TOGAF, etc
The viewpoint is where you look from
The view is what you see
(Project Managers, Engineers, Scientists, Business Analysts, …)
Planetary science example
Defining the reference architectureIn science data systems, construction of multiple
architecture views of a system are criticalProcessInformation/DataTechnology
We find the “views” are similar, but models can be domain specificThis is the opportunity to develop a reusable
reference architecture if the “patterns” can be extracted
Domain Specific Software Architectures*Domain model
Leverage experts who have the “holistic” view and can drive the need for product lines
An unambiguous view is critical (in fact, this has been a problem in science arenas)
Reference requirements Drives the reference architecture However, it is critical to map domain models to reference requirements
in order to understand the solution spaceReference architecture
Satisfies an abstracted set of functions from the reference requirements It’s engineered for the “ilities” reusability, extensibility and
configurability It demonstrates the separation of functional elements of the architecture
* Tracz, Will, Domain-Specific Software Architecture, ACM SIGSOFT, 1995
13
Ingest (Receive,Validate, Accept)
PDS-2010 SystemArchitecture
Process Architecture
Data Architecture TechnologyArchitecture
Information Model
Data Formats
Data Dictionary
Grammar
Catalog/Data Mgmt
Storage
Portal
Search
Data Distribution
Archive Organization
User Tools/Services
Deep Archive
Data Movement
Distributed Infrastructure
Archive (APG, PAG)
Archive
Query/AccessData Node Integration
Data Standards
Technology Standards
Administration
Peer Review Archive Tools
PreservationPlanning
• Data/Information Architecture
• Components, middleware, and communication
• NOTE: Process is implicit here
Middleware andMessaging
Comm Layer
Metamodel
InformationComponents
InformationObject
Domain Model
Metamodel
InformationComponents
InformationObject
Domain Model
Middleware andMessaging
Comm LayerCommon Protocols - TCPIP, ...
Common Messaging - SOAP, JMS, ...
Common Functions - Registry, Repository, ...
Common or Mediated Metamodel - DEDSL,ISO1179, UML
Common or Mediated Domain Models --Planetary Data Systems, EOSDIS, ...
Information Exchange - Science, Mission, etc, DataProducts, Observations, SLE Objects, ...
Communications
Software/Application
DataArchitecture/Content
Software product linesThis is about strategy more than technology
Goal is a software product line thatImplements our reference architectureAllows for construction of core software
components that can be reused across projects and science disciplines
Can demonstrate sufficient cost and schedule benefits without sacrificing flexibility in meeting requirements and adapting to technology change
Object Oriented Data Technology• Represents both a reference
architecture AND a software product line for science data systems
• Exploits common patterns• Delivers reusable software
components as building blocks for construction of higher order data systems
• Applied to multiple science disciplines
• Funded originally back in 1998; runner up for NASA Software of the Year in 2003
• Heavily used by NASA and NIH projects
OODT/Science Web Tools
OODT/Science Web Tools
ArchiveClient
OBJ ECT ORIENTED DATA TECHNOLOGY FRAMEWORK
ProfileXMLData
ProfileXMLData
NavigationService
NavigationService
Data System
2
Data System
2
Data System
1
Data System
1
Other Service 1
Other Service 1
Other Service 2
Other Service 2
QueryServiceQuery
ServiceProductServiceProductService
ProfileServiceProfileService
ArchiveServiceArchiveService
Bridge to External Services
Bridge to External Services
DJC-16
Architectural principles*Separate the technology and the information architectureEncapsulate the messaging layer to support different messaging
implementationsEncapsulate individual data systems to hide uniquenessProvide data system location independence Require that communication between distributed systems use
metadataDefine a model for describing systems and their resources Provide scalability in linking both number of nodes and size of data
setsAllow systems using different data dictionaries and metadata
implementations to be integratedLeverage existing software, where possible (e.g., open source, etc)`
DJC-17
* Crichton, D, Hughes, J. S, Hyon, J, Kelly, S. “Science Search and Retrieval using XML”,Proceedings of the 2nd National Conference on Scientific and Technical Data, National Academy of Science, Washington DC, 2000.
Architectural focusConsistent distributed capabilities
Resource discovery (data, metadata, services, etc), “grid-ing” loosely coupled science system, workflow management
On-demand, shared services (E.g. processing, translation, etc) Processing Translation
Deploy high throughput data movement mechanisms
End-to-end capabilities across the science environment
Reduce local software solutions that do not scale Increasing importance in developing an “enterprise” approach with
common services
Build value-added services and capabilities on top of the infrastructure
DJC-18
Exploiting common patternsHow data is managed (registry/repository,
information objects themselves)…How data is generated, captured, etc (e.g.,
workflow and data processing)…How data is accessed (metadata, data)…How information is discovered …How data is distributed (e.g., transformed)…How data is visualized…
What does OODT do? Tie together loosely coupled distributed heterogeneous data
systems into a virtual data grid
Support critical functions Data Production and workflow Data Distribution Data Discovery (including query optimization across highly distributed
systems) Data Access
An architectural approach first, an implementation second Adapt to different distributed computing deployments Promotes a REST-style architectural pattern for search and retrieval
Scalability in linking together large, distributed data sets
OODT data architecture focus
On types of and relationships among a software system’s data
Decomposition of data within a software system to its logical components and interactions
Components: Data Elements, Data Dictionary, Data Models of individual data sources
Interactions: Mappings between Data Dictionary to Data Models, Data Element structural comparison
Some standards currently exist for data architecture ISO: ISO-11179 Standardization and Specification of Data Elements Dublin Core Metadata Initiative: Dublin Core Data Elements to describe any
electronic resource
Specifications for the Data Architecture Common XML schema for managing information about data
resources Common XML schema for messaging between distributed services Methods for integrating existing domain models within architecture
ProfileAttributes-id: String-version: String-statusID: String-securityType: String-parent: String-children: List-regAuthority: String-revisionNotes: List-dataDictID: String
ProfileAttributes-id: String-version: String-statusID: String-securityType: String-parent: String-children: List-regAuthority: String-revisionNotes: List-dataDictID: String
ResourceAttributes-identifier: String-title: String-formats: List-description: String-creators: List-subjects: List-publishers: List-contributors: List-dates: List-sources: List-languages: List-coverages: List-rights: List-contexts: List-aggregation: String-clazz: String-locations: List
ResourceAttributes-identifier: String-title: String-formats: List-description: String-creators: List-subjects: List-publishers: List-contributors: List-dates: List-sources: List-languages: List-coverages: List-rights: List-contexts: List-aggregation: String-clazz: String-locations: List
ProfileElement-name: String-id: String-desc: String-type: String-unit: String-synonyms: List-obligation: boolean-maxOccurrence: int-comments: String
ProfileElement-name: String-id: String-desc: String-type: String-unit: String-synonyms: List-obligation: boolean-maxOccurrence: int-comments: String
EnumeratedProfileElement
-values: List
EnumeratedProfileElement
-values: List
RangedProfileElement
-min: double-max: double
RangedProfileElement
-min: double-max: double
ProfileProfile
UnspecifiedProfileElement
UnspecifiedProfileElement
MapMap
resourceAttributesprofileAttributeselements
1 1
1
1 11
*
profile profile
Keys areStrings,equal toelements’names
Resource Metadata ModelRequest/Response Model
Based on ISO/IEC 11179
Based on Dublin Core
XMLQuery-resultModeId: String-propogationType: String-propogationLevels: String-maxResults: int-kwqString: String-numResults: int-mimeAccept: List
XMLQuery-resultModeId: String-propogationType: String-propogationLevels: String-maxResults: int-kwqString: String-numResults: int-mimeAccept: List
QueryHeader-id: String-title: String-description: String-type: String-statusID: String-securityType: String-revisionNote: String-dataDictID: String
QueryHeader-id: String-title: String-description: String-type: String-statusID: String-securityType: String-revisionNote: String-dataDictID: String
QueryResult-list: List
QueryResult-list: List
QueryElement-role: String-value: String
QueryElement-role: String-value: String
1
1
1
1
1
1
1
fromSet
selectSet
whereSet
resultqueryHeader
nasa.pds.xmlquery
OODT software componentsProfile Service – A server-based registry that is able
to either serve local XML profiles or plug-into an existing catalog. This component provides resource discovery.
Product Service – A server component that plugs into existing repositories and serves products. This includes translation serves, etc
Catalog and Archive Service – Transaction-based server that catalogs and archives products providing profile and product servers for discovery and distribution
Query Service – Provides query management across distributed services to enable discovery.
DJC-24
3. Repositories for storing and retrieving many types of data
1. Science data tools and applications use “APIs” to connect to a virtual data repository
Visualization Tools
Analysis Tools
OODTReusable
DataGrid
Framework
MissionData
RepositoriesOODT
API
2. Middleware creates thedata grid infrastructure connecting distributed heterogeneous systems and data
BiomedicalData
Repositories
EngineeringData
Repositories
Web Search Tools
OODTAPI
OODTAPI
• Common Meta Models for Describing Space Information Objects• Common Data Dictionary end-to-end
Query Integration
Node 1Profile Server
XML Request
Information Object
XML Request
Info
Obj
ect
XM
L R
eque
st
Repository Product Server
Information Object
Web I/F
Desktop I/FXML Request
Information Object
Name Server
Repository Product Server
Node 1Profile Server
Node 1Profile ServerRegistry Server
Repository/ArchiveServer
…
Name ServerService Registry
XML Request
Information Object
WSDL WSDL
ProductCatalogs
Science Products
ScienceProducts
Science Products
OODT software implementation OODT is Open Source Developed using open source software (i.e. Java/J2EE and XML) Implemented reusable, extensible Java-based software components
Core software for building and connecting data management systems Provided messaging as a “plug-in” component that can be replaced
independent of the other core components. Messaging components include: CORBA, Java RMI, JXTA, Web Services, etc REST seems to have prevailed
Provided client APIs in Java, C++, HTTP, Python, IDL Simple installation on a variety of platforms (Windows, Unix, Mac OS X,
etc) Used international data architecture standards
ISO/IEC 11179 – Specification and Standardization of Data Elements Dublin Core Metadata Initiative W3C’s Resource Description Framework (RDF) from Semantic Web Community
DJC-26
EDRN Knowledge Environment EDRN has been a pioneer in the use of
informatics technologies to support biomarker research
EDRN has developed a comprehensive infrastructure to support biomarker data management across EDRN’s distributed cancer centers
Twelve institutions are sharing data Same architectural framework as planetary
science
It supports capture and access to a diverse set of information and results
Biomarkers Proteomics Biospecimens Various technologies and data products
(image, micro-satellite, …) Study Management
DJC-27
Data and Computers interconnected to
f orm a virtual database Integrated Cancer Resources
SpecimensImagesAssaysBiomarkersetc
DJC-28
• Often unique, one of a kind missions– Can drive technological changes
• Instruments are competed and developed by academic, industry and industrial partners
– Highly distributed acquisition and processing across partner organizations
– Highly diverse data sets given heterogeneity of the instruments and the targets (i.e. solar system)
• Missions are required to share science data results with the research community requiring:
– Common domain information model used to drive system implementations
– Expert scientific help to the user community on using the data
– Peer-review of data results to ensure quality– Distribution of data to the community
• Planetary science data from NASA (and some international) missions is deposited into the Planetary Data System
Source: A. Hooke, NASA/JPL
A GroundTrackin
gNetwork
One or MoreSpacecraft
An Instrument
ControlCenterA Spacecraft
ControlCenter
A ScienceFacility
A SpaceTrackingNetwork
Commodity Space
Communications Systems
Commodity Space
Navigation Systems
One or MoreInstruments
Planetary Data SystemNASA’s official archive for research results
from solar system exploration
Distributed across the United States at “PDS Nodes” 8 nodes including both science nodes and
support nodesData and Services reside at each nodeUnified by a common data architecture and
broad technical architecture
NAIF/JPL
Small Bodies/UMD
Atmospheres/New Mexico State
Geosciences/Washington University
Planetary Plasma/UCLA
Rings/SETI
Radio Science/Stanford
Engineering/JPL Imaging/
USGS
Imaging/JPL
Mars Odyssey THEMIS/ASU Data Node
MRO-HiRISE/UofA Data Node
The data architecture is keyThe planetary community has
developed a diverse model, that is enforced and used in data management NASA-led, but ESA, ISRO, JAXA,
etc are leveraging planetary science data standards
Core “information” model that has been used to describe every type of data from NASA’s planetary exploration missions and instruments ~4000 different types of data
Unique to planetary, but the concept of models and how they apply to science data is not
DJC-32
PDS ImageLabel (ODL)
PDS Image Class (Object-Oriented)
An Image
Describes
DJC-33
• Pre-Oct 2002, no unified view across distributed operational planetary science data repositories
– Science data distributed across the country– Science data distributed on physical media
• Planetary data archive increasing from 4 TBs in 2001 to 100 TBs in 2009
– Traditional distribution infeasible due to cost and system constraints
– Mars Odyssey could not be distributed using traditional method
• Current work with the OODT Data Grid Framework has provided the technology for NASA’s planetary data management infrastructure to
– Support online distribution of science data to planetary scientists
– Enable interoperability between nine institutions– Support real-time access to data products– Provided uniform software interfaces to all Mars
Odyssey data allowing scientists and developers to link in their own tools
– Operational October 1, 2002
• Moving to multi-terrabyte online data movement in 2009
2001 Mars Odyssey
The architecture reuse opportunityWhile planetary has unique constraints and
requirements , the broader architecture patterns are exhibited in other science areasPlanetary can be very unforgiving when it comes to system
failures
Biology and Earth, for example, are DistributedHave similar pipelines and processes
Focus on instruments that perform observations and then analysis of those instruments
Work with data in similar waysAre PI and science-driven
DJC-35
• “To thrive, the field that links biologists and their data urgently needs structure, recognition and support. The exponential growth in the amount of biological data means that revolutionary measures are needed for data management, analysis and accessibility. Online databases have become important avenues for publishing biological data.” – Nature Magazine, September 2008
• The capture and sharing of data to support collaborative research is leading to new opportunities to examine data in many sciences– NASA routinely releases “data analysis programs”
to analyze and process existing data
Apr 22, 2023 35
EDRN DataRepositories
DJC-36
• Initiated in 2000, renewed in 2005• 100+ Researchers (both members and
associated members)• ~40 + Research Institutions• Mission of EDRN
– Discover, develop and validate biomarkers for cancer detection, diagnosis and risk assessment
– Conduct correlative studies/trials to validate biomarkers as indicators of early cancer, pre-invasive cancer, risk, or as surrogate endpoints
– Develop quality assurance programs for biomarker testing and evaluation
– Forge public-private partnerships
• Leverage building distributed planetary science data systems for biomedicine
Apr 22, 2023 37
Instrument Operations Science
Data Processing
DataDistribution
(EDRN Public Portal)
EDRN Bioinformatics Tools
Instrument eCAS - EDRN Biorepository
ExternalScience
Community
EDRN Researchers
Laboratory Biorepository
AnalysisTeam
Local Laboratory Science Data System
Publish Data Sets
EDRN’s Ontology Model EDRN has developed a High level ontology model
for biomarker research which provides standards for the capture of biomarker information across the enterprise
Specific models are derived from this high level model
Model of biospecimens Model for each class of science data
EDRN is specifically focusing on a granular model for annotating biomarkers, studies and scientific results
EDRN has a set of EDRN Common Data Elements which is used to provide standard data elements and values for the capture and exchange of data
DJC-38EDRN Biomarker Ontology Model
EDRN CDE Tools
Apr 22, 2023 39
ESIS -- EDRN Study I
nformati
on System
eCAS -- EDRN Catalog and Archive System
ERNE -- EDRN Resource Network Exchange
BMDB -- NCI Biomarker DB
The EDRN Knowledge Environment
Apr 22, 2023 40
Apr 22, 2023 41
Moving to an integrated semanticarchitecture
Semantic science portal driven by the EDRN ontologySchema loaded into
the ontology via RDFS (and Protégé)
Metadata from distributed applications dumped into the portal via RDF
Moving EDRN towards a “pure” model-driven environment
Other science areasEarth Science
Leveraged OODT software framework for constructing ground data systems for earth science missions Used OODT Catalog and Archive Service
software
Constructed “workflows” Execution of “processors” based on a set
of rules
Medical Research Support for distributed analysis of
pediatric intensive care units
Climate Research Support for distributed modeling
DJC-43
SeaWinds on ADEOS II (Launched Dec 2002)
Related work….The plethora of middleware, e-science and grid
efforts…
Major agency efforts in physical and life sciences…
Standards efforts….
All the technology support (but see my message on next slide as an architect!)
My message…Distributed service architectures
Not anything new (my experience with them goes back to the early 1990s)
But, often, newer technologies and approaches are seen as a panacea
Technology is not a replacement for a conceptual architectureMy experience is that definition of the architecture independent of
technology is critical The goal should be stability in the architecture model; the
selection of appropriate technology will change over timeThis is why an architect is much more of a strategist than a
technologist
More preaching…Think about the entire system and identify
the abstractionsYou need the holistic viewWhat are the patternsWill an architecture framework help?
(separation of process, data, technology, etc views)? Can these evolve independently?
Resources (1) Tracz, Will. Domain-Specific Software Architecture. ACM SIGSOFT, 1995.
(2) D. Crichton, S. Kelly, C. Mattmann, Q. Xiao, J. S. Hughes, J. Oh, M. Thornquist, D. Johnsey, S. Srivastava, L. Esserman, and B. Bigbee. A Distributed Information Services Architecture to Support Biomarker Discovery in Early Detection of Cancer. In Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing, pp. 44, Amsterdam, the Netherlands, December 4th-6th, 2006.
(3) C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721-730, Shanghai, China, May 20th-28th, 2006.
,