Download - “Creating Data Repositories..”
![Page 1: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/1.jpg)
“Creating Data Repositories..”
Sanjay Rao
ECE Dept, Purdue University
![Page 2: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/2.jpg)
Group Members
• Dave Maltz• Rebecca Issacs• Ratul Mahajan• Yin Zhang• Aditya Akella• David Kotz• Charles DiFatta• …..
![Page 3: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/3.jpg)
Motivation
• Network Management Research:– Barrier to entry is high– Data/insights from operators/industry critical
• Examples:– Failure characterization of enterprise network– VLAN characterization and use– Configuration Management
![Page 4: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/4.jpg)
What happens today..?
• End-user centric measurement studies– Network “black-box”: no operator involvement– Real need: “white-box”
• Campus Networks– Difficulties in bootstrapping relationships with operators
• Enterprise/Operator Network– Sprint or AT&T (Microsoft with end-user)– Limited pool of researchers
• Data across multiple enterprises??• Trends over many years ??
![Page 5: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/5.jpg)
Bottomline
• Need a data repository– Contributors from operators, researchers,
industry– Accessible to all researchers
• Facilitate research much like Planetlab
• Vital to have “critical mass” of researchers on Network Management– Research along high-impact real problems
![Page 6: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/6.jpg)
Data Sharing: what inhibits it?
• Sensitivity of data– Security Issues (firewall policies, network structure)– Privacy Issues (records of individual activity)
• Proprietary nature of data – E.g. how many calls got, mobility models– Possible to have others use it?
• “Secret weapon” for research– Competition Vs. collaboration
• Inertia/ too much effort
![Page 7: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/7.jpg)
Solutions
• Carrots/sticks to promote data sharing– “Must release data” to publish – IMC: best paper award only to work releasing
data.
• Technical ways to addressing concerns with sharing
![Page 8: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/8.jpg)
Positive Example
Example: HSARPA “PREDICT”: make research on network security possible.Firewalls and IDS network security data
![Page 9: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/9.jpg)
Research: Anonymization
• Hiding provider, hiding individual information• Need framework to reason about it
– What trade-offs do you make?– What risks are posed?– How to expose trade-offs in a way we can appreciate?
• Anonymization very domain specific– E.g. configuration file Vs. packet trace– Are there common themes?
• Other Models:– NDA-based– “Give me a question” -> “return answer”– “Exploratory” nature of research
![Page 10: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/10.jpg)
Community effort: Cooperate on IRB
• Social Sciences:– Lots of experience with IRB
• Networking:– Lack of clear guidelines on IRB process– Admins feel happier if IRB can “sanction” things
• As community:– Must appreciate need/process for IRB– Develop guidelines for IRB process– Share IRB documents
![Page 11: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/11.jpg)
Creating shareable data
• 75% of time spent figuring how to use data• Researcher needs vary
– Different forms of datum– Historical Vs. Streaming
• Dated? Trending?
– Assumptions made/gaps in data– “timing info crucial at sub-RTT level”?
• Sharing hard, many idiosyncrasies– Data collection infrastructure, annotate
![Page 12: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/12.jpg)
User Diagnostics
• One-on-one: exact data provided• Create shared repository(ies)
– What data do most users want?– Is that 20% of stuff most critical to provide?
• Data Collection Tools• Meta-data part of problem
– Create data in standard formats– “Observatory”:
• How to discover, describe, explain data• Access policy, use policy
![Page 13: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/13.jpg)
Other
• Streaming Data: Online Vs Offline• Scalable collection:
– What to collect? Over how long?– Compression techniques– Fine-grained: overhead, coarse-grained: information
loss• What does it take to build this infrastructure?
– Get all types of data as painlessly as possible– Massage, orchestrate data to fit researcher needs– Simple APIs to get data out – fast analysis tools– Federated Access– DataManagement - Lifecycle of data
![Page 14: “Creating Data Repositories..”](https://reader035.vdocuments.site/reader035/viewer/2022080902/56812baf550346895d8fe601/html5/thumbnails/14.jpg)
Action Items
• Community-Wide Efforts:– Initiate efforts to create data repository
• How to manage? Who contributes? Who arbitrates• How much storage? Lifecycle - How long to store data?
– Create IRB guidelines for networking data • Research:
– Anonymization– Usage diagnostics -> what to collect,release: widely
applicable– Data Collection Tools, metadata information
• Industry,operators must be as actively involved as possible