managing spreadsheets michael cafarella zhe shirley chen, jun chen, junfeng zhang, dan prevo...
TRANSCRIPT
![Page 1: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/1.jpg)
Managing Spreadsheets
Michael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo
University of MichiganNew England Database Summit
February 1, 2013
![Page 2: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/2.jpg)
2
Spreadsheets: The Good Parts
A “Swiss Army Knife” for data: storing, sharing, transforming
Sophisticated users who are not DBAs
Contain lots of data, found nowhere else
Everyone uses them; almost wholly ignored by DB community
Thanks, Jeremy!
![Page 3: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/3.jpg)
3
Spreadsheets: The Awful Parts Users toss in data,
worry about schemas later (well, never)
Spreadsheets designed for humans, not query processors
No explicit schemas: Poor data integrity
(Zeeberg et al, 2004) Integration very hard
• Tumor suppresor gene Deleted In Esophogeal Cancer 1
• aka, DEC1• aka, (according to Excel) 01-DEC
![Page 4: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/4.jpg)
4
Spreadsheets: The Awful Parts Users toss in data,
worry about schemas later (well, never)
Spreadsheets designed for humans, not query processors
No explicit schemas: Poor data integrity
(Zeeberg et al, 2004) Integration very hard
![Page 5: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/5.jpg)
5
A Data Tragedy Spreadsheets build, then entomb, our
best, most expensive, data >400,000 just from ClueWeb09 From gov’ts, WTO, many other sources How many inside firewall?
Application vision: Ad-hoc integration & analysis for any dataset
Challenge: recover relations from any spreadsheet, w/little human effort
![Page 6: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/6.jpg)
6
Closeup
Desired tuple:
One hierarchy error yields many bad tuples
Too many datasets to process manually
![Page 7: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/7.jpg)
7
Agenda Spreadsheets: An Overview Extracting Data
Hierarchy Extraction Manual Repairs
Experimental Results Demo Related and Future Work
![Page 8: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/8.jpg)
8
Agenda Spreadsheets: An Overview Extracting Data
Hierarchy Extraction Manual Repairs
Experimental Results Demo Related and Future Work
![Page 9: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/9.jpg)
9
Extracting Tuples
1. Extract frame, attribute hierarchy trees2. Map values to attributes; create tuples3. Apply manual repairs, repeat How many repairs for 100% accuracy? Yields tuples, not relations We won’t discuss: relation assembly
![Page 10: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/10.jpg)
10
1. Frame Detection
Key assumption: inputs are data frames Locate metadata in top/left regions Locate data in center block
![Page 11: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/11.jpg)
11
Closeup
![Page 12: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/12.jpg)
12
1. Frame Detection Key assumption: inputs are data frames
Locate metadata in top/left regions Locate data in center block
~72% of spreadsheets fit; others not relational Each non-empty row labeled one of TITLE,
HEADER, DATA, FOOTNOTE Reconstruct regions from labels Infer labels with linear-chain Conditional Random Field
(Lafferty et al, 2001) Layout features: has bold cell? Merged cell? Text features: contains ‘table’, ‘total’? Indented text?
Numeric cells? Year cells?
![Page 13: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/13.jpg)
13
2. Hierarchy Extraction
![Page 14: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/14.jpg)
14
Closeup
![Page 15: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/15.jpg)
15
![Page 16: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/16.jpg)
16
2. Hierarchy Extraction
1. One task for TOP, one for LEFT
2. Create boolean random var for each candidate parent relationship
3. Build conditional random field to obtain best variable assignment
![Page 17: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/17.jpg)
17
2. Hierarchy Extraction
![Page 18: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/18.jpg)
18
2. Hierarchy Extraction CRFs use potential functions to incorporate features Node potentials represent single parent/child match
Share style? Near each other? WS-separated? Edge potentials tie pairs of parent/child decisions
Share style pairs? Share text? Indented similiarly? Spreadsheet potentials ensure a legal tree
One-parent potential: -∞ weight for multiple parents Directional potential: -∞ weight when parent edges go in opposite
directions Run Loopy Belief Propagation for node + edge; post-
inference test and repair for spreadsheet Real sheets yielded 1K-8K variables; inference <0.13 sec Approach adapted from (Pimplikar, Sarwagi, 2012)
![Page 19: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/19.jpg)
19
3. Manual Repair User reviews, repairs extraction Goal: reduce user burden
Extractor makes repeated mistakes, either within spreadsheet or within corpus
Headache for user to repeat fixes Our sol’n: after each repair, add repair
potentials to CRF Links user-repaired nodes to a set of nodes
throughout CRF Incorporates info on node similarity Edges are generated heuristically
After each repair, re-run inference
![Page 20: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/20.jpg)
20
Agenda Spreadsheets: An Overview Extracting Data
Hierarchy Extraction Manual Repairs
Experimental Results Demo Related and Future Work
![Page 21: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/21.jpg)
21
Experiments General survey of spreadsheet use Evaluate:
Standalone extraction accuracy Manual repair effectiveness
Test sets: SAUS: 1,322 files from 2010 Statistical
Abstract of the United States WEB: 410,554 files from 51,252
domains, crawled from ClueWeb09
![Page 22: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/22.jpg)
22
Spreadsheets in the Wild
Very common for Web-published gov’t data
Domain # files % total
bts.gov 12,435 3.03%
census.gov 7,862 1.91%
stat.co.jp 6,633 1.62%
bankofengland.co.uk 5,520 1.34%
ers.usda.gov 4,328 1.05%
agr.gc.ca 4,186 1.02%
wto.org 3,863 0.94%
doh.wa.gov 3,579 0.87%
nsf.gov 2,770 0.67%
nces.ed.gov 2,177 0.53%
![Page 23: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/23.jpg)
23
Spreadsheets in the Wild
![Page 24: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/24.jpg)
24
Standalone Extraction 100 random H-Sheets from SAUS, WEB Three metrics
Pairs: parent/child pairs labeled correctly (F1)
Tuples: relational tuples labeled correctly (F1)
Sheets: % of sheets labeled 100% correctly
Two methods Baseline uses just formatting, position Hierarchy uses our approach
![Page 25: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/25.jpg)
25
Standalone Extraction
![Page 26: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/26.jpg)
26
Manual Repair: Effectiveness Gather 10 topic areas from SAUS,
WEB Expert provides ground-truth
hierarchies Extract; repeatedly repair and
recompute
![Page 27: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/27.jpg)
27
Manual Repair: Ordering Good ordering: errors steadily decrease Bad: extended periods of slow decrease
![Page 28: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/28.jpg)
28
End-To-End Extraction What is overall utility of our extractor? Final metric: Correct tuples per manual repair
# Tuples
# Errors
# Repairs
Tuples/Repair
SAUS R50
530.76 5.46 2.06 257.65
SAUS Arts
454.8 25.4 13.1 34.72
SAUS Fin.
266.1 29.9 13.5 19.71
WEB R50
520.28 11.38 3.84 135.49
WEB BTS
65.6 2.7 1 65.6
WEB USDA
350.3 6.8 1.7 206.06
![Page 29: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/29.jpg)
29
Agenda Spreadsheets: An Overview Extracting Data
Hierarchy Extraction Manual Repairs
Experimental Results Demo Related and Future Work
![Page 30: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/30.jpg)
30
Demo Details Ran SAUS corpus through extractor Simple ad hoc integration analysis tool
on top of extracted data Early version of relation reconstruction Early version of data ranking, join finding
![Page 31: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/31.jpg)
31
Related Work Spreadsheet as interface
(Witkowski et al, 2003), (Liu et al, 2009)
Spreadsheet extraction User-provided rules
(Ahmad et al, 2003), (Hung et al, 2011) No explicit user rules
(Abraham and Erwig, 2007), (Cunha et al, 2009)
Ad hoc integration for found data(Cafarella et al, 2009), (Pimplikar and Sarawagi,
2012), (Yakout et al, 2012)
Semi-automatic data programming Wrangler (Guo, et al, 2011)
![Page 32: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649da15503460f94a8ddde/html5/thumbnails/32.jpg)
32
Conclusions and Future Work Spreadsheet extraction opens new
datasets Manual repair ensures accuracy, low
user burden Ongoing and Future Work
Relation assembly Data relevance ranking Join finding