factual presentation for pg west 2010
Post on 18-Jul-2015
705 Views
Preview:
TRANSCRIPT
Factual
Eric LuiSoftware Engineer, Data Storageeric@factual.com
What is Factual.com?
Factual is a platform for sharing, mashing, and publishing open data.
Crowd-Sourced Data
… is terrific!
• Verifiable• Vote-driven• Customizable
Demo
Data Storage
Goal:• 10M tables • 1B rows (summarized)• 10B inputs (or "votes")
Raw storage• 1TB per input server• 100MB+ per dataset
What does all this "scale" mean?
Map-Reduce is the right architecture for us:•High volume storage•Scales (with the right design)•Shards and partitions in-place•Minimal downtime•Throwaway intermediary stages
What does all this "scale" mean?
•Hard to profile•Hard to predict what table will get "hot"•Performance tuning has to be general, unless we're on a Service Level Agreement and can devote DBA resources (not our core strength)•Map-Reduce is not real time
Data Storage
Challenges • Summarization operations are memory-intensive• N-Way merging is expensive (ie., slow)• Streaming is necessary to serve back full summaries• Common use case is just the first N rows
Emerging Patterns
• Many Reads• (Relatively) Few New rows• (Very) Few row Updates• Infrequent (< 1 per day) table-wide re-summarizations
High Availability
Votestore• 3x Redundancy
High Availability
Problem: Summarization is slow.
High Availability
Problem: Summarization is slow. Solution: Build a caching layer.
High Availability
Problem: Summarization is slow. Solution: Build a caching layer.
Cache• 3x Replication• "Dumb" load balancing • Server Affinity (via Zookeeper)
Metaphor Shear
Why PostgreSQL? Pros• End-user expectations map to RDBMS world• Indexing on common operations
o (ORDER BY, WHERE)• Full-text search• Latitude/longitude/geo functions with PostGIS• Aggregation on summarized results• Built-in persistence
Metaphor Shear
Why PostgreSQL? Cons• No built-in "versioning"• Re-summarization, though infrequent, is expensive• Need to map lisp-based query language to SQL
High Availability
Why PostgreSQL? Other considerations• Must pro-actively store attributes• Schema changes are expensive • Handling "upsert" operations is awkward • Deletes are difficult (but infrequent)
• (related) No concept of row merge
Demo
Cache Consistency
ACID? Not really...
High-concurrency
favored over database-style transactions
Cache Consistency
ACID? Not really...
Eventually Consistent
Consistency Challenges
Cache Invalidation• How do I handle new inputs?
Consistency Challenges
Cache Invalidation• How do I handle new inputs?
o Shield the Input Store Low-priority - shield the input store Row-level invalidations
o Lazy Fetch updated rows on summary request Leverage postgres to track invalidations
o Decouple From Input API call Async notification
Consistency Challenges
Cache Instance Management• How do we handle query changes?
o filtering out spam inputso change the aggregation functiono give more weight to table owner's votes
Consistency Challenges
Cache Instance Management• Simple Re-cache
o Dump the current cached copy, and re-cache.o Slowo Poor user experience
Consistency Challenges
Cache Instance Management• Better solution: Double Buffering
o Reload new version in backgroundo Continue to serve current table
"closest match" warningo Allow switch-back
Continue to accept invalidations against old table
Performance
Encoding-compliant tablespaces•Support UTF-8, non-Latin sort orders
Select Tables get SSD-based PostgreSQL caching•See Jignesh Shah's terrific slides from PgEast 2009•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on•20x improvement in random reads (IO pattern for unclustered index reads)•2x improvement on sequential writes (generally pretty smooth)
What's next?
Encoding-compliant tablespaces•Support UTF-8, non-Latin sort orders
Select Tables get SSD-based PostgreSQL caching•See Jignesh Shah's terrific slides from PgEast 2009•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on•20x improvement in random reads (IO pattern for unclustered index reads)•2x improvement on sequential writes (generally pretty smooth)
How can I use Factual?
Web UI • Dataset Creation • Workbench
http://www.factual.com/ APIs• Server API
http://wiki.developer.factual.com/FrontPage • Visualizations
http://wiki.developer.factual.com/Factual-Visualization-Documentation
Questions
top related