web data management raghu ramakrishnan. - 2 - research quiq lessons structured data management...
Post on 19-Dec-2015
212 views
TRANSCRIPT
Web Data Management
Raghu Ramakrishnan
- 2 -Research
QUIQ Lessons
• Structured data management powers scalable collaboration environments
• ASP
• Multi-tenancy
• Massively distributed
• Fine-grained permissions, hierarchical acls
• RDBMSs were a lousy fit
- 3 -Research
Cloud Computing: Computing as a Service
Cloud Computing
CPU IntensiveData Intensive
AnalyticE.g., SSDS,Hadoop
PackagedSoftware
High-throughputE.g., Condor
“Transactional”Storage & Serving
E.g., PNUTS, S3, SSDS, UDB
- 4 -Research
Implications
• Data management as a service– Scientists and others who’ve resisted (installing, maintaining, and) using DBMSs
will find it much easier to reap the benefits– “Data centers” and “Computing Centers” will come into vogue again
• Hosted back-ends and RAD tools will make Web application development accessible to all– The Web is becoming open
• E.g., OpenSocial, OpenID • Ideas will be the most valuable currency, not the wherewithal to build complex systems
• Paradigm shifts possible for how we do research in many fields– Build applications that embed your algorithms and test them directly in the field—
Computer Scientists can interact directly with users (ironically, this would still be a breakthrough of sorts after four decades!)
– Many other disciplines (e.g., Sociology, microeconomics) can design and conduct online experiments involving unprecedented numbers of participants
- 5 -Research
PNUTS: DB in the Cloud
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
Parallel databaseParallel database Geographic replicationGeographic replication
Indexes and viewsIndexes and views
Structured, flexible schemaStructured, flexible schema
Hosted, managed infrastructureHosted, managed infrastructure
- 6 -Research
Basic Consistency Model
Goal: • Make it easier for applications to reason about updates and cope with asynchrony—alternative to
“transactions” in an asynchronous world• What happens to a record with primary key “Brian”?
Guarantees:• Every reader will always see some consistent, but possibly stale version• Readers can request a more up-to-date version, but may pay extra latency
– Special case: Critical read (writer/readers see their own writes)• Writers can verify that the record is still at the version they expect
Time
Record inserted
Update Update Delete
v. 1 v. 2 v. 3
Generation 1
Record inserted
Update Update Delete
v. 1 v. 2 v. 4
Generation 2
Update
v. 3
Record inserted Delete
v. 1
Generation 3
- 7 -Research
Lots of Issues to Re-think
• Massive distribution & replication– Asynchrony– Availability– Consistency
• DBA to the world– Auto-tuning– Multi-tenancy– Access control (granularity, online ids)– Encryption
• App-support– Caching
- 8 -Research
Querying the Web
• Search will become more semantic—best-effort match-making between: – Query intent (NLP, query logs …)– Interpreted web content
• Deep web has a lot of structured data– How we get a handle on it is an interesting problem– But this is only part of the problem … lots of data not here
• Semantic web isn’t working• Site-wrapping doesn’t scale
• Solutions?– Domain-wrapping – Mass collaboration– ??