the internet as a single database
DESCRIPTION
A sneak peek at how we are building DatafinitiTRANSCRIPT
![Page 1: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/1.jpg)
The Internet as a Single DatabaseTechnologies Used & Lessons Learned
Houston Code Camp, August 2011Shion DeysarkarCEO, Datafiniti
![Page 2: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/2.jpg)
What does that mean?
Places, people, news, URLs, products, etc., etc.All web data in one, unified format
Accessible as if you were querying a database
![Page 3: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/3.jpg)
Why build such a thing?
Web crawling is kludgy and unintuitiveOur users needed a better way of getting web data
Developers deserve something better than current APIs
![Page 4: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/4.jpg)
Why build such a thing?
Because it would be awesome!
![Page 5: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/5.jpg)
Not an easy task…
The Challenges
![Page 6: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/6.jpg)
The Challenges
There’s a lot of data on the web
100 million registered domainsMaybe only 100,000 have interesting stuff? (Which ones?)Some sites have millions or billions of data points
![Page 7: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/7.jpg)
It’s all structured differently!
Do we have to write web crawls for each website?Writing 100,000 web crawlers seems.. not fun
The Challenges
![Page 8: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/8.jpg)
Data can conflict
How do we know which data is correct?
Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars
American (New)Music Venues
4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights
(713) 880-8737
Citysearch Max's Wine Dive RestaurantsWine Bars
4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral
(713) 880-8737
Urbanspoon Max's Wine Dive AmericanInternational
4720 Washington Ave 77007 Rice Military (713) 880-8737
Google Max's Wine Dive Wine BarAmerican Restaurant
4720 Washington Ave 77007-5436 (713) 880-8737
Zagat Max's Wine Dive EclecticInt'l
4720 Washington Ave. 77007 Heights 713-880-8737
The Challenges
![Page 9: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/9.jpg)
So let’s start at the beginning:
Data Collection
![Page 10: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/10.jpg)
Data CollectionBuilding a scalable web crawler
Cloud or local data center? Neither.Grid computing (think SETI@home)1000s of home PCs that exchange time & bandwidth for $Crawl very fast for relatively little $
![Page 11: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/11.jpg)
Data CollectionBuilding a scalable web crawler
Coding 1000s of extraction apps
Abstract away everything but pattern matching and link generation
Build a framework that handles all the kludgy work:- Link following & de-duplication- Result formatting & storage- Throttle rates & crawling behavior- Any other crawling activity not specific to a website’s structure
- Load lightweight, website-specific apps into above framework
![Page 12: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/12.jpg)
Data CollectionBuilding a scalable web crawler
Coding 1000s of extraction appsAbstract away everything but pattern matching and link generation
![Page 13: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/13.jpg)
Data CollectionBuilding a scalable web crawler
Coding 1000s of extraction appsAbstract away everything but pattern matching and link generation
![Page 14: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/14.jpg)
Data CollectionBuilding a scalable web crawler
Current peak performance: 4.32 billion URLs per monthDeploying 20 new website crawls every monthEasy to scale crawling performance (just add grid nodes)Easy to scale deployment (just add contractors)
![Page 15: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/15.jpg)
Now for step 2! (step 1 took us 3 years >_<)
Data Storage
![Page 16: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/16.jpg)
Data StorageBuilding a scalable data store
What we’re dealing with:TBs (eventually PBs) of dataBillions of rows, Thousands of columns (maybe more)Don’t want to deal with shardingDon’t actually care about ACIDDo care about high-throughput and fault-tolerance
![Page 17: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/17.jpg)
Data StorageBuilding a scalable data store
NoSQL (Cassandra) >> MySQL (for us)Can increase throughput and storage linearly by adding nodesVirtually unlimited and variable # of columnsMuch faster read/writeSome challenges
- Doesn’t yet support all the select features you’re used to- Not a mature technology yet, expect frequent updates
![Page 18: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/18.jpg)
Data StorageBuilding a scalable data store
Choosing Cassandra over other NoSQL databasesMore active community, seems to be gaining traction most quickly
Impressive production-scale examples
Backed by corporations (DataStax) and some really smart people
Integrated with other relevant technologies- Solr for text search- Hadoop for batch-style processing
- Though it’s true it has some high-profile scrappings
![Page 19: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/19.jpg)
Data StorageBuilding a unified database of everything
Normalizing separate data points that represent the same thingCo-occurrence: most popular choice wins
Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars
American (New)Music Venues
4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights
(713) 880-8737
Citysearch Max's Wine Dive RestaurantsWine Bars
4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral
(713) 880-8737
Urbanspoon Max's Wine Dive AmericanInternational
4720 Washington Ave 77007 Rice Military (713) 880-8737
Google Max's Wine Dive Wine BarAmerican Restaurant
4720 Washington Ave 77007-5436 (713) 880-8737
Zagat Max's Wine Dive EclecticInt'l
4720 Washington Ave. 77007 Heights 713-880-8737
![Page 20: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/20.jpg)
Data StorageBuilding a unified database of everything
Normalizing separate data points that represent the same thingTrusted sources: put more weight on sources that tend to be right
Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars
American (New)Music Venues
4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights
(713) 880-8737
Citysearch Max's Wine Dive RestaurantsWine Bars
4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral
(713) 880-8737
Urbanspoon Max's Wine Dive AmericanInternational
4720 Washington Ave 77007 Rice Military (713) 880-8737
Google Max's Wine Dive Wine BarAmerican Restaurant
4720 Washington Ave 77007-5436 (713) 880-8737
Zagat Max's Wine Dive EclecticInt'l
4720 Washington Ave. 77007 Heights 713-880-8737
![Page 21: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/21.jpg)
Data StorageBuilding a unified database of everything
Identifying interesting data on a random web page
![Page 22: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/22.jpg)
Yay, step 3! (step 2 took us 3 months :D)
Data Retrieval
![Page 23: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/23.jpg)
Data RetrievalBuilding an easy way to get lots of data fast
Making the right choices for our APISingle channel for all data retrieval
- RESTful API so anyone can develop with it- All external and internal functionality uses the same API (easier to manage)
As user-friendly and intuitive as possible- SQL-style querying on a NoSQL database- JSON default output, but will also supports CSV and XML- SSL authentication with token
Briefly considered using a 3rd-party service like Mashery
![Page 24: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/24.jpg)
Put it all together… (step 3 took 3 weeks!!!)
Sneak Peak
![Page 25: The Internet as a Single Database](https://reader035.vdocuments.site/reader035/viewer/2022062704/5562bc9ed8b42a13618b4c4d/html5/thumbnails/25.jpg)
Sign up for the beta at http://www.datafiniti.netFollow us @Datafiniti
Launching Soon