navigational data management
TRANSCRIPT
Navigational Data Management
Martin Greenwald, Josh Stillerman, John WrightMIT – Plasma Science & Fusion Center
IAEA TM on Fusion Data Processing June 2, 2017
“I have a system for storing my data and getting it back, aren’t I done?”
IAEA TM 2017 Navigational Data Management 2
● In general, our existing approach to capturing and exploiting this class of metadata has been ad hoc and inadequate
● This hampers data discovery and the ability to assemble coherent, complete, useful data sets.
● Probably not.
● Collecting data has never been easier, but…
● We’re struggling to keep up with the rapidly growing volume and complexity of scientific data.
Our Thesis
● The challenge is all about giving this mountain of data meaning and putting it into context
● Context requires metadata describing relationships among data objects – “navigational metadata”
Discovering and Understanding Data Depends On Context
IAEA TM 2017 Navigational Data Management 3
● Historical attempts at organizing human knowledge: systems for ordering and categorizing
● Data discovery relies on “adjacency” to find other interesting data
● Historically we’ve each build a set of ad-hoc, domain specific tools to store, explore, and retrieve this relationship metadata.
● Similar issues confront all data intensive areas of research.
● Can we solve these problems in our own domain?
● Can we generalize these to provide solutions across a broader set of domains?
It’s no accidentOrganizing knowledge is an old problem
Define adjacencies – ease searching & browsing
Background: Our Approach To This Problem Arose From A Decades Long Process of Generalization & Abstraction
IAEA TM 2017 Navigational Data Management 4
● Each step made the collection and organization of data easier – salient examples include
● MDSplus (www.mdsplus.org)
– Provides a well-characterized, hierarchical description of diagnostics, setup, calibration measurements, analyzed data, metadata, data acquisition workflow
– A single organizational hierarchy dominates (tree)
● MPO (Metadata, Provenance, Ontology) (mpo.psfc.mit.edu )
– Captures experimental, analysis and simulation workflows
– A single organization schema dominates (directed acyclic graph)
● Electronic Logbook database & schema
● These are particular examples of a general, graph-representation of data and relationships
Complexity: What Sorts Of Data Might Exist From A Typical Experiment?
IAEA TM 2017 Navigational Data Management 5
● Hierarchical data stores with raw and processed data (~105 named data objects per shot)
● Relational databases with “high level” results
● Electronic logbooks & annotation
● Experimental proposals
● Run Plans & Summaries
● Data provenance systems
● Data catalogs
● Data dictionaries, name space management
● Information about experimental campaigns & plans
● Publications & presentations
● Information about researchers, authors
● Simulation inputs & outputs
● Source code management systems
● Facility information, with details of experiment, measurement systems
● Document, drawing management systems
● QA, QC information
● WBS for projects
All Of Those Data Are Linked In Multiple and Complex Ways
DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 7
List of all experimental proposals with links to runs where they were executed
All Of Those Data Are Linked In Multiple and Complex Ways
DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 8
Full text of experimental proposals
All Of Those Data Are Linked In Multiple and Complex Ways
DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 9
List of all experimental runs and dates with links to proposals
All Of Those Data Are Linked In Multiple and Complex Ways
DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 10
Summary information about each run
All Of Those Data Are Linked In Multiple and Complex Ways
DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 11
Summary information about each shot
All Of Those Data Are Linked In Multiple and Complex Ways
DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 12
Logbook – shot by shot annotation
All Of Those Data Are Linked In Multiple and Complex Ways
DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 13
MDSplus tree with links to all of the raw and processed data
Currently, Our Capture of This Relationship Web Is Incomplete, Ad Hoc and Asymmetric
IAEA TM 2017 Navigational Data Management 14
● Incomplete
– Some relationships are explicitly represented in databases
– Some are implicit in data or text
– Some are only known or accessible by particular users
– Some are not recorded and can be forgotten and lost forever
● Ad Hoc
– We’ve added this information as needs arise
– Schemas, vocabulary are not always consistent
– Level of detail is uneven
● Asymmetric
– Example: We point to interesting data from the logbook (annotation); but do not point to annotation from data (many, many other examples)
Organization of Data – By Diagnostic System
IAEA TM 2017 Navigational Data Management 15
Top
Thomson Scattering InterferometerECE
Te
ResultsHardware ResultsHardwareResultsHardware Raw Data
ne neTe
VB
Results
Raw Brightness
Zeff The C-Mod data system has ~105 such nodes with significant metadata for each node
Organization of Data – By Physics Parameter
IAEA TM 2017 Navigational Data Management 16
Top
Thomson Scattering InterferometerECE
Te
ResultsHardware ResultsHardwareResultsHardware Raw Data
ne neTe
Plasma Parameters
VB
Results
DensityTemperatures
Raw Brightness
Zeff Such a graph would also be linked to descriptions of experiments, annotation, etc.
Organization of Data – By Data Provenance
IAEA TM 2017 Navigational Data Management 17
Top
Thomson Scattering InterferometerECE
Te
ResultsHardware ResultsHardwareResultsHardware Raw Data
ne neTe
Zeff Calculation
VB
Results
Raw Brightness
Zeff Such a graph would also be linked to descriptions of analysis codes, annotation, etc.
As We Navigate, The Data May Be Organized Into Different Topologies
IAEA TM 2017 Navigational Data Management 18
We Need A Systematic Approach To Represent These Relationships
IAEA TM 2017 Navigational Data Management 19
● A complex data store can be represented as a collections of data objects
– With attributes (metadata)
– With relationships to other data objects (navigational metadata)
● These relationships organize the data objects into multiple organizational paradigms
– Graphs of different topologies
– Trees, Lists, DAGs, Clouds, etc
● Note: Each data object is typically a member of several organizational schemes
– Example: We could organize a data tree by diagnostic hardware (e.g. Thomson Scattering) or physical measurement (e.g. Te as measured by TS, ECE, etc) or by it’s use in an analysis chain (e.g. kinetic MHD equilibrium)
We’ve Just Begun A Small Project To Add These Capabilities
IAEA TM 2017 Navigational Data Management 20
● Goals
– Systematically represent and expose data relationships
– Data driven
o Meta-Schema (the collection of relationship types and properties) defined as data
o Instances (actual relationships between specific records) stored as data
– Extensible - meet new needs without refactoring or writing (much) new code
– Broadly applicable, domain independent
● Work Products: a tool-set to build and navigate relationship web - schema and instances
– API (For building, populating and interrogating databases)
– GUI (For traversing, browsing, searching and displaying)
● Targeting data managers, not directly end users
– Build complex data system from existing & new components
System Needs To Represent Only Two Kinds Of Elements
IAEA TM 2017 Navigational Data Management 21
● Data objects
– With attributes (= metadata)
– Includes pointer (URI) to data & protocol
● Relations
– With attributes
● We can recognize these as the nodes and edges that define a mathematical graph
● Example from MPO system – data preparation for GYRO code
Not Intended As Stand-alone: Would Not Replace All Other Data Stores
IAEA TM 2017 Navigational Data Management 22
● Point to data objects in these external stores via URI – agnostic to the type of data being referred to
● URI includes protocol – indicates software ecosystem needed to read & navigate data store
– MDSplus “mdsplus://cmod/1050426022/electrons/Thomson/results/te”
– HDF5
– File system “file://server_name.domain/home/username/gs2/run_1234/inputs/grad_te”
– RDB
– Etc.
● Full functionality requires that each data store allow access to data, metadata, schema and navigation via API
– External data could range from transparent or opaque depending on capabilities of its underlying system
– Some systems could be retro-fit to improve access to information
Initial Implementation, R&D
IAEA TM 2017 Navigational Data Management 23
● Refactor existing annotations (logbook), Experimental Proposals, Runs, Shots, etc into new system as proof of principle
● Compare various database technologies (Relational, NoSQL, Graph (neo4j, orientdb))
– How hard is it to set up?
– Can it represent our objects?
– How hard is it to populate?
– How hard is it to query? How fast to execute?
– How hard to write applications against?
– How robust? Can it scale?
● Write an initial Single Page Application (SPA) using modern web front-end (VueJS, Angular, Polymer,…)
● Iterate
Challenges: Potentially Infinite Scope
IAEA TM 2017 Navigational Data Management 24
● Taken to its limits, this representation of information can grow without bounds
– Semantic web running into difficulties – trying to bring “meaning” to information on the web
– Truncation will be key, limit scope to useful relationships
– Granularity – some data objects must be treated as opaque, without internal structure as far as this system is concerned (example – long time series)
– Follow 80:20 rule – 80% of the value attained with the first 20% of the effort
● These issues also impact applications and visualization
– Visual browsing & data discovery is only practical if it matches human cognition
– Adopt a “filter, then browse” paradigm
Challenges: Costs & Mitigations
IAEA TM 2017 Navigational Data Management 25
● For this class of metadata to be useful they have to be populated
– Effort required for data managers and for users
– Costs are localized, benefits are general
● Ease the slope to entry
– Drive development from acknowledged use cases
– Encode existing data relationship systems
– Mine existing data sets
– Automate metadata capture wherever possible
– Create an easy to use API and provide plenty of examples
– Build example applications, templates
END
IAEA TM 2017 Navigational Data Management 26