big data research progress chao jan 22, 2013. big data lab big data@csail, mit – – 23 nodes

Big Data Research Progress

ChaoJan 22, 2013

Big Data Lab• Big Data@CSAIL, MIT

– http://bigdata.csail.mit.edu/– 23 nodes– GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO– VISION MACHINE: LEARNING ONLINE FROM 25 MILLION

IMAGES– NATURAL LANGUAGE INTERFACE FOR BIG DATA– SCIDB– MACHINE LEARNING– SOCIAL: CONDENSR– SOCIAL: TWITINFO– SOCIAL: INFLUENCE MODELING– …

http://bigdata.csail.mit.edu/

http://bigdata.csail.mit.edu/

Big Data Lab

• NASA tournament lab– http://www.nasa.gov/directorates/heo/ntl/

• Big data challenge– http://open.nasa.gov/blog/2012/10/03/nasa-tour

nament-labs-big-data-challenge/

– Apply the process of open innovation to conceptualizing new and novel approaches to using “big data” information sets from various U.S. government agencies, e.g., health, energy and earth science.

http://www.nasa.gov/directorates/heo/ntl/

http://www.nasa.gov/directorates/heo/ntl/

http://open.nasa.gov/blog/2012/10/03/nasa-tournament-labs-big-data-challenge/



Big Data People• Jimmy Lin (University of Maryland)

– http://www.umiacs.umd.edu/~jimmylin/• Ron Bekkerman (LinkedIn)

– http://people.cs.umass.edu/~ronb/• Misha Bilenko (MSR)

– http://research.microsoft.com/en-us/um/people/mbilenko/• John Langford (Yahoo! Research)

– http://hunch.net/~jl/

http://www.umiacs.umd.edu/~jimmylin/

http://www.umiacs.umd.edu/~jimmylin/

Tutorial

• Scaling Up Machine Learning-Parallel and Distributed Approaches

• KDD’2011• Ron Bekkerman (LinkedIn), Misha Bilenko

(MSR) and John Langford (Yahoo! Research)• http://hunch.net/~large_scale_survey/

Tutorial

• State-of-the-art platforms and algorithm choices• Hardware options (from FPGAs and GPUs to multi-core

systems and commodity clusters)• Programming frameworks (including CUDA, MPI,

MapReduce, and DryadLINQ)• Learning settings (e.g., semi-supervised and online

learning)• Example-driven, covering a number of popular

algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., speech recognition and object recognition in vision).

Parallelization: platform choices

Platform Communication Scheme Data size

Peer-to-Peer TCP/IP Petabytes

Virtual Clusters MapReduce / MPI Terabytes

HPC Clusters MPI / MapReduce Terabytes

Multicore Multithreading Gigabytes

GPU CUDA Gigabytes

FPGA HDL Gigabytes

The Book

• Cambridge Uni Press• Due in November 2011• 21 chapters• Covering– Platforms– Algorithms– Learning setups– Applications

10

2

Chapter contributors

34

56

78

9

11

12

1314

1516

1718

1920

21

New age of big data

• The world has gone mobile– 5 billion cellphones produce daily data

• Social networks have gone online– Twitter produces 200M tweets a day

• Crowdsourcing is the reality– Labeling of 100,000+ data instances is doable• Within a week

Big Data Data

• DATA.GOV– http://

www.data.gov/developers/community/developers

– Data portal provided by US government

http://www.data.gov/developers/community/developers



Big Data in Q&A

• It is estimated that 2.5 quintillion bytes of new data are created daily with an estimated 80% of this produced as "unstructured" data

• IBM Watson deep Q&A– http://www.research.ibm.com/articles/watson.shtml– Evidence-based decision support– Jeopardy! – Provide a single correct answer with confidence– Analyze over 200 million pages in three seconds

http://www.research.ibm.com/articles/watson.shtml

http://www.research.ibm.com/articles/watson.shtml

Big Data in Q&A

• IBM Watson deep Q&A– Health care• 2011, pilot program with WellPoint, whose affiliated

health plans cover one in nine Americans• 2012, partnership with Memorial Sloan-Kettering

Cancer Center, where work is under way to teach Watson about oncology diagnosis and treatment options

Big Data Blog

• http://whatsthebigdata.com/– News and events about Big Data

• http://www.greenplum.com/industry-buzz/big-data/research-papers– News and research papers about Big Data

http://whatsthebigdata.com/

http://whatsthebigdata.com/

http://www.greenplum.com/industry-buzz/big-data/research-papers



Big Data Publication• Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query

Suggestion Architecture• http://arxiv.org/pdf/1210.7350v1.pdf• Architecture behind Twitter's real-time related query suggestion

and spelling correction service– First implementation: typical Hadoop-based analytics stack, did

not meet the latency requirement– Second implementation: system deployed in production,

custom in-memory processing engine

http://arxiv.org/pdf/1210.7350v1.pdf

http://arxiv.org/pdf/1210.7350v1.pdf

Big Data Publication• Fast Candidate Generation for Two-Phase Document Ranking:

Postings List Intersection with Bloom Filters• http://www.umiacs.umd.edu/~

jimmylin/publications/Asadi_Lin_CIKM2012.pdf• Most modern web search engines employ a two-phase ranking

strategy: a candidate list of documents is generated using a “cheap” but low-quality scoring function, which is then reranked by an “expensive" but high-quality method

• Candidate generation for conjunctive query processing in this context

• A fast, approximate postings list intersection algorithms based on Bloom Filters

http://www.umiacs.umd.edu/~jimmylin/publications/Asadi_Lin_CIKM2012.pdf

http://www.umiacs.umd.edu/~jimmylin/publications/Asadi_Lin_CIKM2012.pdf

Big Data Publication• Why Not Grab a Free Lunch? Mining Large Corpora for Parallel

Sentences to Improve Translation Modeling– http://www.umiacs.umd.edu/~jimmylin/publications/

Ture_Lin_NAACL-HLT2012.pdf• Large-Scale Machine Learning at Twitter

– http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

• Smoothing Techniques for Adaptive Online Language Models: Topic Tracking in Tweet Streams– http://www.umiacs.umd.edu/~jimmylin/publications/

Lin_etal_KDD2011.pdf

Big Data Book

• Data-Intensive Text Processing with MapReduce

• http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf

big data research progress chao jan 22, 2013. big data lab big data@csail, mit – – 23 nodes

Documents

big data data

big data lab big data

big data challenge http

datachallenge http

big linked data

data instances

new age of big data

big data information