factual presentation for pg west 2010
TRANSCRIPT
![Page 2: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/2.jpg)
What is Factual.com?
Factual is a platform for sharing, mashing, and publishing open data.
![Page 3: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/3.jpg)
Crowd-Sourced Data
… is terrific!
• Verifiable• Vote-driven• Customizable
![Page 4: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/4.jpg)
Demo
![Page 5: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/5.jpg)
Data Storage
Goal:• 10M tables • 1B rows (summarized)• 10B inputs (or "votes")
Raw storage• 1TB per input server• 100MB+ per dataset
![Page 6: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/6.jpg)
What does all this "scale" mean?
Map-Reduce is the right architecture for us:•High volume storage•Scales (with the right design)•Shards and partitions in-place•Minimal downtime•Throwaway intermediary stages
![Page 7: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/7.jpg)
What does all this "scale" mean?
•Hard to profile•Hard to predict what table will get "hot"•Performance tuning has to be general, unless we're on a Service Level Agreement and can devote DBA resources (not our core strength)•Map-Reduce is not real time
![Page 8: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/8.jpg)
Data Storage
Challenges • Summarization operations are memory-intensive• N-Way merging is expensive (ie., slow)• Streaming is necessary to serve back full summaries• Common use case is just the first N rows
![Page 9: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/9.jpg)
Emerging Patterns
• Many Reads• (Relatively) Few New rows• (Very) Few row Updates• Infrequent (< 1 per day) table-wide re-summarizations
![Page 10: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/10.jpg)
High Availability
Votestore• 3x Redundancy
![Page 11: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/11.jpg)
High Availability
Problem: Summarization is slow.
![Page 12: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/12.jpg)
High Availability
Problem: Summarization is slow. Solution: Build a caching layer.
![Page 13: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/13.jpg)
High Availability
Problem: Summarization is slow. Solution: Build a caching layer.
Cache• 3x Replication• "Dumb" load balancing • Server Affinity (via Zookeeper)
![Page 14: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/14.jpg)
Metaphor Shear
Why PostgreSQL? Pros• End-user expectations map to RDBMS world• Indexing on common operations
o (ORDER BY, WHERE)• Full-text search• Latitude/longitude/geo functions with PostGIS• Aggregation on summarized results• Built-in persistence
![Page 15: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/15.jpg)
Metaphor Shear
Why PostgreSQL? Cons• No built-in "versioning"• Re-summarization, though infrequent, is expensive• Need to map lisp-based query language to SQL
![Page 16: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/16.jpg)
High Availability
Why PostgreSQL? Other considerations• Must pro-actively store attributes• Schema changes are expensive • Handling "upsert" operations is awkward • Deletes are difficult (but infrequent)
• (related) No concept of row merge
![Page 17: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/17.jpg)
![Page 18: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/18.jpg)
Demo
![Page 19: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/19.jpg)
Cache Consistency
ACID? Not really...
High-concurrency
favored over database-style transactions
![Page 20: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/20.jpg)
Cache Consistency
ACID? Not really...
Eventually Consistent
![Page 21: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/21.jpg)
Consistency Challenges
Cache Invalidation• How do I handle new inputs?
![Page 22: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/22.jpg)
Consistency Challenges
Cache Invalidation• How do I handle new inputs?
o Shield the Input Store Low-priority - shield the input store Row-level invalidations
o Lazy Fetch updated rows on summary request Leverage postgres to track invalidations
o Decouple From Input API call Async notification
![Page 23: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/23.jpg)
Consistency Challenges
Cache Instance Management• How do we handle query changes?
o filtering out spam inputso change the aggregation functiono give more weight to table owner's votes
![Page 24: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/24.jpg)
Consistency Challenges
Cache Instance Management• Simple Re-cache
o Dump the current cached copy, and re-cache.o Slowo Poor user experience
![Page 25: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/25.jpg)
Consistency Challenges
Cache Instance Management• Better solution: Double Buffering
o Reload new version in backgroundo Continue to serve current table
"closest match" warningo Allow switch-back
Continue to accept invalidations against old table
![Page 26: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/26.jpg)
Performance
Encoding-compliant tablespaces•Support UTF-8, non-Latin sort orders
Select Tables get SSD-based PostgreSQL caching•See Jignesh Shah's terrific slides from PgEast 2009•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on•20x improvement in random reads (IO pattern for unclustered index reads)•2x improvement on sequential writes (generally pretty smooth)
![Page 27: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/27.jpg)
What's next?
Encoding-compliant tablespaces•Support UTF-8, non-Latin sort orders
Select Tables get SSD-based PostgreSQL caching•See Jignesh Shah's terrific slides from PgEast 2009•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on•20x improvement in random reads (IO pattern for unclustered index reads)•2x improvement on sequential writes (generally pretty smooth)
![Page 28: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/28.jpg)
![Page 29: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/29.jpg)
How can I use Factual?
Web UI • Dataset Creation • Workbench
http://www.factual.com/ APIs• Server API
http://wiki.developer.factual.com/FrontPage • Visualizations
http://wiki.developer.factual.com/Factual-Visualization-Documentation
![Page 30: Factual presentation for pg west 2010](https://reader030.vdocuments.site/reader030/viewer/2022032617/55aa2af91a28abc1188b45a7/html5/thumbnails/30.jpg)
Questions