ground: managing metadata in the big data ecosystem

48
Ground: Managing Metadata in the Big Data Ecosystem Vikram Sreekanti AMP Lab, U.C. Berkeley

Upload: vikram-sreekanti

Post on 14-Apr-2017

500 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Ground: Managing Metadata in the Big Data Ecosystem

Ground: Managing Metadata in the Big Data EcosystemVikram SreekantiAMP Lab, U.C. Berkeley

Page 2: Ground: Managing Metadata in the Big Data Ecosystem

What is data?

• 20th Century Data: Accounting• “02/16: Sally withdrew $100 from checking.”

• 21st Century Data: Raw materials• Sally’s online purchase records…• ... and timeseries data from her FitBit• ... and popular films for various demographics• ... and weather forecast for the next 48 hours.

Page 3: Ground: Managing Metadata in the Big Data Ecosystem

What is Metadata?

• Data about data• This used to be so simple!

Page 4: Ground: Managing Metadata in the Big Data Ecosystem

What is Metadata?

• Data about data• This used to be so simple!

• But... schema on use• One of many changes

Page 5: Ground: Managing Metadata in the Big Data Ecosystem

InterpretationAnalysis Interoperability

Reproducibility Governance & The Collective

What is Metadata?

Page 6: Ground: Managing Metadata in the Big Data Ecosystem

Analysis

Page 7: Ground: Managing Metadata in the Big Data Ecosystem

Case: Data Analysis

Wrangle

Visualize

AnalyzeData

Results

METAMNESIA

Page 8: Ground: Managing Metadata in the Big Data Ecosystem

—JIM GRAY

One of the things that my research advisor Mike Harrison taught me to do is to WRITE THINGS DOWN. I’M IN THE FLOW.WRITE THINGS DOWN. TENSION

Page 9: Ground: Managing Metadata in the Big Data Ecosystem

You will never know your data better than when you are wrangling and analyzing it.

The flow state

Page 10: Ground: Managing Metadata in the Big Data Ecosystem

TAKE ACTION

Data Analytics Infrastructure team

“Write down what you can, we’ll fill in the rest.”

Page 11: Ground: Managing Metadata in the Big Data Ecosystem

Taking Action: Football

• Video data Annotations.• Metadata from manual annotation

Page 12: Ground: Managing Metadata in the Big Data Ecosystem

Taking Action: Football

• Video data Annotations.• Passive metadata: sensor streams• NFL + MS = Cool.

Page 13: Ground: Managing Metadata in the Big Data Ecosystem

Taking Action: Football

• Video data Annotations.• Passive metadata: sensor streams• NFL + MS = Cool.

• Metadata + Simulation• NFL + MS + EA = POV.

Page 14: Ground: Managing Metadata in the Big Data Ecosystem

Capture what people do with data. Augment as appropriate. Interpolate as needed.

Taking Action: Data Analysis

Page 15: Ground: Managing Metadata in the Big Data Ecosystem

Analysis

• tap the flow• fill in the rest

metadata-on-use

Interoperability

Page 16: Ground: Managing Metadata in the Big Data Ecosystem

CASE:Data Debugging

Page 17: Ground: Managing Metadata in the Big Data Ecosystem

CASE:Data Debugging

Page 18: Ground: Managing Metadata in the Big Data Ecosystem
Page 19: Ground: Managing Metadata in the Big Data Ecosystem

Relationships

Master Data on Customers

Call detail from HDFS

Data Wrangling Script

Python Numpy

Churn Analysis

Hypothesis Wrangle

Page 20: Ground: Managing Metadata in the Big Data Ecosystem

Pythonv2.7

Numpyv1.9.3

Wranglev3.0

Versioned Relationships

Master Data on CustomersMDM 10/11/15

Call detail from HDFSv1.26

Data Wrangling Scriptgit hash 0x6987a68a9876b7

Churn Analysisgit hash 0x987667e876f033

Hypothesis

Page 21: Ground: Managing Metadata in the Big Data Ecosystem

Common ground?

• SW market exploding

• n2 connections

• Need a shared place to Write it down, Link it up

Page 22: Ground: Managing Metadata in the Big Data Ecosystem

InterpretationAnalysis

• tap the flow• fill in the rest

metadata-on-use

Interoperability

• metadata as protocol• general formats

standards-on-use

Page 23: Ground: Managing Metadata in the Big Data Ecosystem

CASE:Recommender System

• Consider a recommender system like Netflix• Consists of data (user views & ratings, movie features) and a

statistical model.• For any piece of data: “Sally watched The Shining”• This fact is much more meaningful with a model: the model makes the

recommendations!• The model is also no good without data: data is used to train & refine the

model.

Page 24: Ground: Managing Metadata in the Big Data Ecosystem

CASE:Recommender System

• Any machine learning system has this coupling.

• Interpretation of the data depends on the model we choose.• Models are parametrized by data.• The meaning & value of data in any context is the coupling of model +

data.

Page 25: Ground: Managing Metadata in the Big Data Ecosystem

Reproducibility

Analysis

• tap the flow• fill in the rest

metadata-on-use

Interoperability

• metadata as protocol• general formats

standards-on-use

Interpretation

• models interpret data• data conditions models

(data + metadata)-on-use

Page 26: Ground: Managing Metadata in the Big Data Ecosystem

Can metadata cure cancer?

Page 27: Ground: Managing Metadata in the Big Data Ecosystem

No.

Page 28: Ground: Managing Metadata in the Big Data Ecosystem

But it’s going to be useful.

Page 29: Ground: Managing Metadata in the Big Data Ecosystem

Case: Cancer Genomics

Generalpopulation data

(“1000 genomes”)

Compare

Clustering AlgorithmPatient Data

Put leukemia cells on slide

Robot putschemistry on slides

Robot puts slide on gene sequencerX 1000 patients

Page 30: Ground: Managing Metadata in the Big Data Ecosystem

Data Lineage

Back to tissue and bar codes on slides!

Logical & Physical• Tissue• Data (and metadata)• Code

Page 31: Ground: Managing Metadata in the Big Data Ecosystem

It gets messier

Generalpopulation data

(“1000 genomes”)

Compare

Put leukemia cells on slide

Robot putschemistry on slides

Robot puts slide on gene sequencerX 1000 patients

Parameter Sweep

Parameters

Clustering Algorithm

Page 32: Ground: Managing Metadata in the Big Data Ecosystem

Analysis

• tap the flow• fill in the rest

metadata-on-use

Interoperability

• metadata as protocol• general formats

standards-on-use

Reproducibility

• instrumentation• lineage: success & failure

lab notebook-on-use

Governance & The Collective

Interpretation

• models interpret data• data conditions models

(data + metadata)-on-use

Page 33: Ground: Managing Metadata in the Big Data Ecosystem

Back at the Enterprise

We’re talking Governance.• And self-service for end users

Page 34: Ground: Managing Metadata in the Big Data Ecosystem

CASE:Jupyter Notebook

• An electronic lab notebook• Evolution of IPython Notebook• Writing it down since 2011

Page 35: Ground: Managing Metadata in the Big Data Ecosystem

Running a Class from NotebooksAssignments are notebooks• Students create versions• Staff solution is a version

Grading• Execute each notebook on some data• Annotating the notebook with grades• Updating a grades spreadsheet

Page 36: Ground: Managing Metadata in the Big Data Ecosystem

Homework Governance

Skools ’n rools!• Students can’t see each others’ HW• Students can’t see solution• Unless they’ve turned in theirs

and it’s after April 12 and they have a Berkeley login

• Graders can’t see student names• Students can’t update

grade spreadsheet

Page 37: Ground: Managing Metadata in the Big Data Ecosystem

Collective Intelligence

Rules should be a small part of school.

If we do things well…• People get smarter• Educational software gets smarter• Organizations get smarter

Fueled by observing, learning, iterating.

Write things down, fill in later.

Page 38: Ground: Managing Metadata in the Big Data Ecosystem

Collective, Intelligent Governance

By the people. Emergent governance.• Sandbox → Annotations → Awareness → Reuse → Debate → ConsensusFor the people. Collective Intelligence emerges.

http://blogs.forrester.com/michele_goetz/15-09-24-are_data_preparation_tools_changing_data_governance

Page 39: Ground: Managing Metadata in the Big Data Ecosystem

Analysis

• tap the flow• fill in the rest

metadata-on-use

Interoperability

• metadata as protocol• general formats

standards-on-use

Reproducibility

• instrumentation• lineage: success & failure

lab notebook-on-use

Governance & The Collective• by & for the people• collective intelligence

governance-on-use

Interpretation

• models interpret data• data conditions models

(data + metadata)-on-use

Page 40: Ground: Managing Metadata in the Big Data Ecosystem

What we’re doing: Ground

• Focus our design on useful & interesting challenges for real problems

• Develop a general but expressive data model• Don’t prescribe design principles;

support as many as possible

Page 41: Ground: Managing Metadata in the Big Data Ecosystem

Data Model: Core

• “Thing”: basic logical object• Immutable• Every “Thing” has a version history.

Models

Usage

Versions

Page 42: Ground: Managing Metadata in the Big Data Ecosystem

Design Principle: Immutability & Versioning

• Recall versioning• Reproducibility = time travel.• Alternate histories: What-if?

Pythonv2.7

Numpyv1.9.3Master Data on

CustomersMDM 10/11/15

Call detail from HDFSv1.26

Data Wrangling Scriptgit hash 0x6987a68a9876b7

Churn Analysisgit hash 0x987667e876f033

Hypothesis Wranglev3.0

Page 43: Ground: Managing Metadata in the Big Data Ecosystem

Data Model: Mantle

• Structures, Nodes, Edges, Graphs• Subclasses of “Thing”

• Allows for modeling of dataModels

Usage

Versions

Page 44: Ground: Managing Metadata in the Big Data Ecosystem

Data Model: Crust

• Lineage relationships between “Thing”s

Models

Usage

Versions

Page 45: Ground: Managing Metadata in the Big Data Ecosystem

Design Principle: Lineage

• Lineage is fundamental to any metadata system• Versions track the state of things

(”who”, ”what”, and “when”)• Lineage captures causes &

influences (”how”)

Page 46: Ground: Managing Metadata in the Big Data Ecosystem

Design Principle: Postel’s Law• “Be conservative in what

you do, be liberal in what you accept from others.”

Page 47: Ground: Managing Metadata in the Big Data Ecosystem

Design Principle: Neutrality

• Open-source & vendor-neutral• Related to Postel’s Law: Be as

diverse as possible while still being useful.

Page 48: Ground: Managing Metadata in the Big Data Ecosystem

Check out what we’ve done so far: https://github.com/ground-metadata/ground

Reach out if you’re interested: @viksree

Most slides were taken from Joe Hellerstein’s Strata NYC 2015 Talk: “Time to go Meta (on use)”.