big data research progress chao jan 22, 2013. big data lab big data@csail, mit – – 23 nodes

18
Big Data Research Progress Chao Jan 22, 2013

Upload: thomasina-sherman

Post on 24-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Big Data Research Progress

ChaoJan 22, 2013

Big Data Lab• Big Data@CSAIL, MIT

– http://bigdata.csail.mit.edu/– 23 nodes– GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO– VISION MACHINE: LEARNING ONLINE FROM 25 MILLION

IMAGES– NATURAL LANGUAGE INTERFACE FOR BIG DATA– SCIDB– MACHINE LEARNING– SOCIAL: CONDENSR– SOCIAL: TWITINFO– SOCIAL: INFLUENCE MODELING– …

Big Data Lab

• NASA tournament lab– http://www.nasa.gov/directorates/heo/ntl/

• Big data challenge– http://open.nasa.gov/blog/2012/10/03/nasa-tour

nament-labs-big-data-challenge/

– Apply the process of open innovation to conceptualizing new and novel approaches to using “big data” information sets from various U.S. government agencies, e.g., health, energy and earth science.

Big Data People• Jimmy Lin (University of Maryland)

– http://www.umiacs.umd.edu/~jimmylin/• Ron Bekkerman (LinkedIn)

– http://people.cs.umass.edu/~ronb/• Misha Bilenko (MSR)

– http://research.microsoft.com/en-us/um/people/mbilenko/• John Langford (Yahoo! Research)

– http://hunch.net/~jl/

Tutorial

• Scaling Up Machine Learning-Parallel and Distributed Approaches

• KDD’2011• Ron Bekkerman (LinkedIn), Misha Bilenko

(MSR) and John Langford (Yahoo! Research)• http://hunch.net/~large_scale_survey/

Tutorial

• State-of-the-art platforms and algorithm choices• Hardware options (from FPGAs and GPUs to multi-core

systems and commodity clusters)• Programming frameworks (including CUDA, MPI,

MapReduce, and DryadLINQ)• Learning settings (e.g., semi-supervised and online

learning)• Example-driven, covering a number of popular

algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., speech recognition and object recognition in vision).

Parallelization: platform choices

Platform Communication Scheme Data size

Peer-to-Peer TCP/IP Petabytes

Virtual Clusters MapReduce / MPI Terabytes

HPC Clusters MPI / MapReduce Terabytes

Multicore Multithreading Gigabytes

GPU CUDA Gigabytes

FPGA HDL Gigabytes

The Book

• Cambridge Uni Press• Due in November 2011• 21 chapters• Covering– Platforms– Algorithms– Learning setups– Applications

10

2

Chapter contributors

34

56

78

9

11

12

1314

1516

1718

1920

21

New age of big data

• The world has gone mobile– 5 billion cellphones produce daily data

• Social networks have gone online– Twitter produces 200M tweets a day

• Crowdsourcing is the reality– Labeling of 100,000+ data instances is doable• Within a week

Big Data Data

• DATA.GOV– http://

www.data.gov/developers/community/developers

– Data portal provided by US government

Big Data in Q&A

• It is estimated that 2.5 quintillion bytes of new data are created daily with an estimated 80% of this produced as "unstructured" data

• IBM Watson deep Q&A– http://www.research.ibm.com/articles/watson.shtml– Evidence-based decision support– Jeopardy! – Provide a single correct answer with confidence– Analyze over 200 million pages in three seconds

Big Data in Q&A

• IBM Watson deep Q&A– Health care• 2011, pilot program with WellPoint, whose affiliated

health plans cover one in nine Americans• 2012, partnership with Memorial Sloan-Kettering

Cancer Center, where work is under way to teach Watson about oncology diagnosis and treatment options

Big Data Blog

• http://whatsthebigdata.com/– News and events about Big Data

• http://www.greenplum.com/industry-buzz/big-data/research-papers– News and research papers about Big Data

Big Data Publication• Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query

Suggestion Architecture• http://arxiv.org/pdf/1210.7350v1.pdf• Architecture behind Twitter's real-time related query suggestion

and spelling correction service– First implementation: typical Hadoop-based analytics stack, did

not meet the latency requirement– Second implementation: system deployed in production,

custom in-memory processing engine

Big Data Publication• Fast Candidate Generation for Two-Phase Document Ranking:

Postings List Intersection with Bloom Filters• http://www.umiacs.umd.edu/~

jimmylin/publications/Asadi_Lin_CIKM2012.pdf• Most modern web search engines employ a two-phase ranking

strategy: a candidate list of documents is generated using a “cheap” but low-quality scoring function, which is then reranked by an “expensive" but high-quality method

• Candidate generation for conjunctive query processing in this context

• A fast, approximate postings list intersection algorithms based on Bloom Filters

Big Data Publication• Why Not Grab a Free Lunch? Mining Large Corpora for Parallel

Sentences to Improve Translation Modeling– http://www.umiacs.umd.edu/~jimmylin/publications/

Ture_Lin_NAACL-HLT2012.pdf• Large-Scale Machine Learning at Twitter

– http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

• Smoothing Techniques for Adaptive Online Language Models: Topic Tracking in Tweet Streams– http://www.umiacs.umd.edu/~jimmylin/publications/

Lin_etal_KDD2011.pdf

Big Data Book

• Data-Intensive Text Processing with MapReduce

• http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf