bigdatabench - ict.ac.cnprof.ict.ac.cn/dcomputing/uploads/2013/dc_4_2_bigdata... · 2013. 7....

43
INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from Web Search Engines Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Jianfeng Zhan, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, and Bizhu Qiu 1 Third Workshop on Architectures and Systems for Big Data (ASBD 2013) held in conjunction with ISCA-40 http://prof.ict.ac.cn/BigDataBench

Upload: others

Post on 26-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

INS

TITUTE O

F CO

MP

UTIN

G TEC

HN

OLO

GY

BigDataBench: a Big Data Benchmark Suite from Web Search Engines

Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Jianfeng Zhan, Yongqiang He,

Shiming Gong, Xiaona Li, Shujie Zhang, and Bizhu Qiu

1

Third Workshop on Architectures and Systems for Big Data (ASBD 2013) held in conjunction with ISCA-40

http://prof.ict.ac.cn/BigDataBench

Page 2: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

INS

TITUTE O

F CO

MP

UTIN

G TEC

HN

OLO

GY

Acknowledgements

• This work is supported by the Chinese 973 project (Grant No.2011CB302502), the Hi‐Tech Research and Development (863) Program of China (Grant No.2011AA01A203, No.2013AA01A213), the NSFC project (Grant No.60933003, No.61202075) , the BNSFproject (Grant No.4133081), and Huaweifunding. 

2/

Page 3: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Requirements for Big Data Benchmarking

<<Propoal for a Big Data Benchmark Repository>>--Andries Engelbrecht . WBDB2012

To truly reflect 

• Use cases • Requirements

To rapidly evolve

• New workloads • New use cases

To widely cover

• Application domains• Data types

Page 4: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Challenges

Current State‐‐Immature “We Don't Know Enough to Make a Big Data Benchmark Suite”An Academia‐Industry View, Yanpei Chen, UC Berkeley/Cloudera WBDB2012

4/

Big Data System

Use Case Diversity

Data Scale

Rapid System Evolution

Big Data Benchmarks

System Complexity

Identify Typical Behavior

Reproduce Behavior

Keep Pace with Changes

Develop ModelsVagueness

Variety

Volume

Velocity

Page 5: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

State‐of‐Practice and State‐of‐Art 

Sort : MinuteSort, JouleSort, TeraByte Sort  Only one application: one‐fits‐all solution?

GridMix, Hadoop microbenchmark Generating data set randomly Ignoring the characteristics of real‐world data 

BigBench From electronic commerce Only describe a specification without implementation 

5

Page 6: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

BigDataBench: an open source project 

Available from     http://prof.ict.ac.cn/BigDataBench

Data generation tools Generate GB, TB, or PB‐scale data from seed data

BigDataBench 1.0 Beta version Six benchmarks from web search engines

6/

Page 7: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

OutlineMotivation

Benchmarking Methodology

Case Studies

Future Work

Page 8: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Benchmarking Methodology

8/

Following An Incremental Approach

Considering Variety of Workloads

Methodology of Generating Big Data

Page 9: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Investigate Application Domain

9/

Why we choose Search Engine for the first step?

So many application domains

Page 10: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Popularity of Search Engines

10/

92% of online adults use search engines to find information on the web.

(pew internet study)

Page 11: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Page Views & Daily Visitors

Search Engine is the most important Internet service workload. 

40% of top 20 websites

http://www.alexa.com/topsites/global;0

11/

40%

25%

15%

5% 15%

Search Engine Social Network

Electronic Commerce Media Streaming

Top 20 websites

Page 12: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Benchmarking Methodology

12/

Following An Incremental Approach

Considering Variety of Workloads

Methodology of Generating Big Data

Page 13: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Our two practices

Surveying Open‐Source Common Search Engines Backed by practioners of several industry partners

• From

Building a sematic search engine  (Chinese) ProfSearch

• Search scientists or professionals• 251,564 researchers across 260 universities and institutes• http://prof.ict.ac.cn/

13/

Page 14: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

HVC TutorialHPCA 2013

Overview of a Typical Search Engine

Page 15: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Algorithms in a Typical Search Engine

15/

graph mining

grep & segmentation

pagerankword count

sort

vector calculation

Page 16: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

ProfSearch

• Scrapy

Crawler Workloads

• SVM, Naïve Bayes, K‐means, HMM, CRFs, LSA, LDA

Analysis Workloads

• HDFS – Storing unstructured web pages • HIVE – Storing semi‐structured intermediate data• MySQL – Storing structured data extracted from the web

Store and Management Workloads

• Sphinx

Web Service Workloads

16/

Page 17: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Three “V” of Big Data

Variety

17/

3 VComputation Memory Access Patterns

I/O Access Patterns

Diversity of typical workloads must be considered!

different

Page 18: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Workloads in BigDataBench 1.0 Beta

18/

Analysis Workloads Simple but representative operations

• Sort, Grep, Wordcount Highly recognized algorithms

• Naïve Bayes, SVM

Search Engine Service Workloads Widely deployed services

• Nutch Server

Page 19: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

HVC TutorialHPCA 2013

Features of Workloads

Workloads Resource Characteristic Computing Complexity Instructions

Sort I/O bound O(n*lgn) Integer comparison domination 

WordcountCPU bound

O(n)Integer comparison and calculation domination

GrepHybrid

O(n)Integer comparison 

domination

Naïve Bayes/ O(m*n)

[m: the length of dictionary]

Floating‐point computation domination

SVM/ O(M*n)

[M: the number of support vectors * dimension]

Floating‐point computation domination

Nutch ServerI/O & CPU bound Integer comparison 

domination

Page 20: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Variety of Workloads are included

Workloads

Off‐line

Base Operations

I/O boundSort

CPU boundWordcount

HybridGrep

Machine Learning

Naïve Bayes SVM

On‐line

NutchServer

20/

Page 21: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Benchmarking Methodology

21/

Following An Incremental Approach

Considering Variety of Workloads

Methodology of Generating Big Data

Page 22: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Big Data Puzzle

Confidential for 

company

Difficult to download

Easily get small‐scale 

data

22/

Big Data

Page 23: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

HVC TutorialHPCA 2013

Methodology of Generating Big Data

To preserve the characteristics of real‐world data

Small‐scale Data Big Data

Characteristic Analysis  Expand

Semantic Locality

Temporally

SpatiallyWord frequency

Word reuse distance

Word distribution in document

Page 24: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Scalable Data Generation Tool

24/

Real query trace

Synthetic query rate

Timing ModelSemantic ModelLocality Model

Small‐scale data

Synthetic big data

Timing ModelSemantic ModelLocality Model

Request Generation

Input Data Generation

Page 25: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

BigDataBench‐Data Analysis Workloads

25/

Data Generation Tool

Wordcount

Grep

Naïve Bayes

Sort

Big Data

Generate from small‐scale data 

Use HDFS as the storage

Representative workloads in Search Engine

SVM

Page 26: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

BigDataBench‐Service Workload

Data Generation Tool

Massive  Requests

Generate from small‐scale query 

sequences and timing sequences

Use local file system  as the 

storage  Search engine

Page 27: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

OutlineMotivation

Benchmarking Methodology

Case Studies

Future Work

Page 28: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Use Case 1: System Evaluation

Using BigDataBench 1.0 Beta  Data Scale

10 GB – 2 TB

Hadoop Configuration 1 master 14 slave node

28/

Page 29: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

System Evaluation a threshold for each workload

100M ~ 1TB System is fully loaded when the data 

volume exceeds the threshold

Sort is an exception An inflexion point(10GB ~ 1TB) Data processing rate decreases after 

this point  Global data access requirements

• I/O and network bottleneck

System performance is dependent on applications and data volumes.

Page 30: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Use case 2: Architecture Research

Using BigDataBench 1.0 Beta  Data Scale

10 GB – 2 TB

Hadoop Configuration 1 master 14 slave node

30/

Page 31: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Use case 2: Architecture Research

Some micro‐architectural events are tending towards stability when the data volume increases to a certain extent

Cache and TLB behaviors have different trends with increasing data volumes for different workloads L1I_miss/1000ins: increase for Sort, decrease for Grep

31/

Page 32: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Search engine service experiments

Same phenomena is observed Micro‐architectural events 

are tending towards stability when the index size increases to a certain extent

Big data impose challenges to architecture researches since large‐scale simulation is time‐consuming

32/

Index size:2GB ~ 8GBSegment size:4.4GB ~ 17.6GB

Page 33: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

OutlineMotivation

Benchmarking Methodology

Case Studies

Conclusion and Future Work

Page 34: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Conclusion (1)

We create a big data benchmark suite from web search engines  Data generation tools  and six workloads 

First open‐source project on big data benchmarking http://prof.ict.ac.cn/BigDataBench

Welcome downloading 34/

Page 35: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Conclusions (2)

Workloads

Data Volumes

The peak data processing rates of big data systems

35/

The peak performance of big data systems is dependent on applications and data volumes

Page 36: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Conclusions (3)

36/

Architectural events

Stability

Data Volume

Big data impose challenges to experiment methodologies in architecture researches

Page 37: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Future Work (1)

Release BigDataBench 2.0 Consider data variety!

• Structured, semi‐structure, and unstructured data 

Include multimedia data and applications Include online data analysis applications

2013/7/10

Page 38: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Future Work (2)

Include workloads in other important internet service  domains Electronic commence  Social network

Internet services take up only a small part of big data applications A long way to go!!!

2013/7/10

Page 39: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Thank you!Any questions?

39/

Page 40: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

HVC TutorialHPCA 2013

Backup 

40/

Page 41: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Verifying  the validity of data generation methodology

A black‐box observation approach System metrics Micro‐architecture events 

Data Set 146‐MB real data from wikipedia 146‐MB synthetic data using our tool

Hadoop Configuration 1 master 1 slave node

41/

Page 42: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Data Processing Rate

The data processing rates are close for the synthetic and real data

42/

Page 43: BigDataBench - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_2_BigData... · 2013. 7. 10. · INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from

ISCA-40ASBD 2013

Cache and TLB Behavior

Similar cache and TLB behaviors for the synthetic and real data

43/