big tools for big data
TRANSCRIPT
Big Tools for Big DataAnalytics and Management at web scale
IIPC General Assembly, Singapore, May 2010
Lewis Crawford
Web Archiving Programme Technical LeadBritish Library
2
Big Data “the Petabyte age”
Internet Archive stores about 2 Petabytes of data and grows at 20TB a month
Large Hadron Collider 15PB / year
At the BL
Selective Web Archive growing at
200GB a month
Conservative estimate for
Domain Crawl is 100TB
3
The problem of big data
We can process data very quickly but we can read/write it very slowly
1990 1 GB disk 4.4MB/s read whole disk in 5 mins
2010 1 TB disk 100MB/s read whole disk in 2.5 hours
The solution!
4
Solution: parallel reads
1 HDD = 100 MB/sec 1000 HDDs = 100 GB/sec
5
Hadoop
2002 Nutch Crawler - Doug Cutting
2003 GFS http://labs.google.com/papers/gfs.html
2004 Map Reduce http://labs.google.com/papers/mapreduce.html
2005 Nutch moves to Map Reduce model with NDFS
2006 NDFS and Map Reduce model becomes Hadoop
under
2008 Top level project at Apache
2009 17 clusters with 24,000 nodes at Yahoo!
1TB sorted in 62 seconds
100TB sorted in 173 minutes
6
Hadoop Users
Yahoo!
More than 100,000 CPUs in >25,000 computers running Hadoop
Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Hadoop on larger clusters
Baidu - the leading Chinese language search engine
Hadoop used to analyze the log of search and do some mining work on web page database We handle about 3000TB per week Our clusters vary from 10 to 500 nodes
Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each (commodity) node has 8 cores and 12 TB of storage.
http://wiki.apache.org/hadoop/PoweredBy
Nutchwax!
7
8
Hadoop@BL
IBM Digital Democracy for the BBC
9
10
Bigsheets!
11
BigSheets and the open source stack
Top level Apache Project
Yahoo! Contributed open source
IBM Research Licence
Insight Engine Spreadsheet Paradigm
SQL ‘like’ programming language
Distributed processing and file system
12
Analytics - the meta tag example.
Extract meta data tags from all html files in the 2005 General Election Collection
Extract ‘keywords’ from metatags
Record all html pages into three separate ‘bags’ where keywords contained:
Tory, Conservative Labour Liberal, Lib Dem, Liberal Democrat
Analyse single and pairs of words in each of those ‘bags’ of data
Generate Tag clouds from the 50 most common words.
Data management
13
robots.txt example
14
Robots.txt continued…
15
16
17
Data management
High level management tool – Spreadsheet paradigm
Clean User interface
Straightforward programming model (UDF’s)
Use cases: ARC to WARC migration Information package generation (SIP) CDX indexes / Lucene indexes JHOVE object validation / verification Object format migration.
18
Slash Page crawl - election sites extraction
Slash page (home page) of known UK domains Data discarded after processing
Generate list of election terms (Politcal parties, Mori election tags)
Extract text from html pages using an HTML tag density algorithm
Identify all web pages that contain these words
Identify sites that contain two or more of the terms
Slash Page Data
19
Text Extracted Using Tag Density Algorithm
20
Election Key Terms
21
Results
22
Pie Chart Visualization
23
Seeds With 2 Or More Terms
24
Manual Verification
25
26
Other potential potential digital material
Digital Books
Datasets
19th Century Newspapers
27
Back to analytics and the next generation access tools
Automatic Classification – WebDewey, LOC Subject Headings Machine learning
Faceted lucene indexes for Advanced Search functionality
Engage directly with Higher Education community
Access tool – researcher focus? BL 3 year Research Behaviour Study
Thank you!
http://uk.linkedin.com/in/lewiscrawford
3x30 Nehalem-based node grids, with 2x4 cores, 16GB RAM, 8x1TB storage using ZFS in a JBOD configuration.
Hadoop and Pig for discovering People You May Know and other fun facts.
28