Transcript
Page 1: “Creating Data Repositories..”

“Creating Data Repositories..”

Sanjay Rao

ECE Dept, Purdue University

Page 2: “Creating Data Repositories..”

Group Members

• Dave Maltz• Rebecca Issacs• Ratul Mahajan• Yin Zhang• Aditya Akella• David Kotz• Charles DiFatta• …..

Page 3: “Creating Data Repositories..”

Motivation

• Network Management Research:– Barrier to entry is high– Data/insights from operators/industry critical

• Examples:– Failure characterization of enterprise network– VLAN characterization and use– Configuration Management

Page 4: “Creating Data Repositories..”

What happens today..?

• End-user centric measurement studies– Network “black-box”: no operator involvement– Real need: “white-box”

• Campus Networks– Difficulties in bootstrapping relationships with operators

• Enterprise/Operator Network– Sprint or AT&T (Microsoft with end-user)– Limited pool of researchers

• Data across multiple enterprises??• Trends over many years ??

Page 5: “Creating Data Repositories..”

Bottomline

• Need a data repository– Contributors from operators, researchers,

industry– Accessible to all researchers

• Facilitate research much like Planetlab

• Vital to have “critical mass” of researchers on Network Management– Research along high-impact real problems

Page 6: “Creating Data Repositories..”

Data Sharing: what inhibits it?

• Sensitivity of data– Security Issues (firewall policies, network structure)– Privacy Issues (records of individual activity)

• Proprietary nature of data – E.g. how many calls got, mobility models– Possible to have others use it?

• “Secret weapon” for research– Competition Vs. collaboration

• Inertia/ too much effort

Page 7: “Creating Data Repositories..”

Solutions

• Carrots/sticks to promote data sharing– “Must release data” to publish – IMC: best paper award only to work releasing

data.

• Technical ways to addressing concerns with sharing

Page 8: “Creating Data Repositories..”

Positive Example

Example: HSARPA “PREDICT”: make research on network security possible.Firewalls and IDS network security data

Page 9: “Creating Data Repositories..”

Research: Anonymization

• Hiding provider, hiding individual information• Need framework to reason about it

– What trade-offs do you make?– What risks are posed?– How to expose trade-offs in a way we can appreciate?

• Anonymization very domain specific– E.g. configuration file Vs. packet trace– Are there common themes?

• Other Models:– NDA-based– “Give me a question” -> “return answer”– “Exploratory” nature of research

Page 10: “Creating Data Repositories..”

Community effort: Cooperate on IRB

• Social Sciences:– Lots of experience with IRB

• Networking:– Lack of clear guidelines on IRB process– Admins feel happier if IRB can “sanction” things

• As community:– Must appreciate need/process for IRB– Develop guidelines for IRB process– Share IRB documents

Page 11: “Creating Data Repositories..”

Creating shareable data

• 75% of time spent figuring how to use data• Researcher needs vary

– Different forms of datum– Historical Vs. Streaming

• Dated? Trending?

– Assumptions made/gaps in data– “timing info crucial at sub-RTT level”?

• Sharing hard, many idiosyncrasies– Data collection infrastructure, annotate

Page 12: “Creating Data Repositories..”

User Diagnostics

• One-on-one: exact data provided• Create shared repository(ies)

– What data do most users want?– Is that 20% of stuff most critical to provide?

• Data Collection Tools• Meta-data part of problem

– Create data in standard formats– “Observatory”:

• How to discover, describe, explain data• Access policy, use policy

Page 13: “Creating Data Repositories..”

Other

• Streaming Data: Online Vs Offline• Scalable collection:

– What to collect? Over how long?– Compression techniques– Fine-grained: overhead, coarse-grained: information

loss• What does it take to build this infrastructure?

– Get all types of data as painlessly as possible– Massage, orchestrate data to fit researcher needs– Simple APIs to get data out – fast analysis tools– Federated Access– DataManagement - Lifecycle of data

Page 14: “Creating Data Repositories..”

Action Items

• Community-Wide Efforts:– Initiate efforts to create data repository

• How to manage? Who contributes? Who arbitrates• How much storage? Lifecycle - How long to store data?

– Create IRB guidelines for networking data • Research:

– Anonymization– Usage diagnostics -> what to collect,release: widely

applicable– Data Collection Tools, metadata information

• Industry,operators must be as actively involved as possible


Top Related