treasure data cloud strategy

59
Treasure Data Cloud Strategy Masahiro Nakagawa July Tech Festa: Jul 14, 2013 Sunday, July 14, 13

Upload: treasure-data-inc

Post on 08-May-2015

13.808 views

Category:

Technology


1 download

DESCRIPTION

Keynote slide on July Tech Festa 2013

TRANSCRIPT

Page 1: Treasure Data Cloud Strategy

Treasure Data Cloud Strategy

Masahiro Nakagawa

July Tech Festa: Jul 14, 2013

Sunday, July 14, 13

Page 2: Treasure Data Cloud Strategy

Who are you?§ Masahiro Nakagawa

• @repeatedly / [email protected]

§ Treasure Data, Inc.• Senior Software Engineer, since 2012/11

§ Open Source projects• D Programming Language• MessagePack: D, Python, etc...• Fluentd: Core, Mongo, Logger, etc...• etc...

2

Sunday, July 14, 13

Page 3: Treasure Data Cloud Strategy

Treasure Data overview

Sunday, July 14, 13

Page 4: Treasure Data Cloud Strategy

Company Overview§ Silicon Valley-based Company

• All Founders are Japanese• Hironobu Yoshikawa• Kazuki Ohta• Sadayuki Furuhashi

• About 20 people• Over 3.5 million jobs

§ OSS Enthusiasts• MessagePack, Fluentd, etc.

4

Sunday, July 14, 13

Page 5: Treasure Data Cloud Strategy

Investors

§ Bill Tai§ Othman Laraki - Former VP Growth at Twitter§ James Lindenbaum, Adam Wiggins, Orion Henry -

Heroku Founders§ Anand Babu Periasamy, Hitesh Chellani - Gluster

Founders§ Yukihiro “Matz” Matsumoto - Creator of Ruby§ Dan Scheinman - Director of Arista Networks§ Jerry Yang - Founder of Yahoo!

5

Sunday, July 14, 13

Page 6: Treasure Data Cloud Strategy

6

Data Volume

Cloud

EnterpriseRDBMSLightweight

RDBMS

DB2

1Bil entryOr 10TB

TraditionalData Warehouse

$10Bmarket

$34Bmarket

Database-as-a-service

Big Data-as-a-Service

On-Premise

© 2012 Forrester Research, Inc. Reproduction Prohibited

Treasure Data = Cloud + Big Data

Sunday, July 14, 13

Page 7: Treasure Data Cloud Strategy

The Problem with Other Solutions7

CustomerValue

TimeSign-up or PO

On-Premise Solutions

Obsolescenceover time

Treasure Data

Fully integrated Big Data full-stack service with simple interface, low friction initial engagement & continuous

technical upgrade

Need Upgrade

AWS(or hosted Hadoops)EC2

EMR

RedShift

S3 Step-by-step manual integrations

Maintain

NO SpecialistsTOO LONG to get Live

=

Complex Solutions

+

Data Collection

+

Sunday, July 14, 13

Page 8: Treasure Data Cloud Strategy

8

Big Data Adoption Stages

Intelligence Sophistication

Standard Reports

Ad-hoc Reports

Drill Down Query

Alerts

Statistical Analysis

Predictive Analysis

Optimization

What happened?

Where?

Where exactly?

Error?

Why?

What’s a trend?

What’s the best?

Analytics

Reporting

Sunday, July 14, 13

Page 9: Treasure Data Cloud Strategy

8

Big Data Adoption Stages

Intelligence Sophistication

Standard Reports

Ad-hoc Reports

Drill Down Query

Alerts

Statistical Analysis

Predictive Analysis

Optimization

What happened?

Where?

Where exactly?

Error?

Why?

What’s a trend?

What’s the best?

Analytics

Reporting

Treasure Data’s FOCUS

(80% of needs)

Sunday, July 14, 13

Page 10: Treasure Data Cloud Strategy

9

Full Stack Support for Big Data Reporting

Our best-in-class architecture and operations team ensure the integrity and availability of your data.

Data from almost any source can be securely and reliably uploaded using td-agent in streaming or batch mode.

Our SQL, REST, JDBC, ODBC and command-line interfaces support all major query tools and approaches.

You can store gigabytes to petabytes of data efficiently and securely in our cloud-based columnar datastore.

Sunday, July 14, 13

Page 11: Treasure Data Cloud Strategy

We are...

10

Big Data as a Service

not

Hadoop on Cloud

Sunday, July 14, 13

Page 12: Treasure Data Cloud Strategy

Columnar Storage+

HadoopMapReduce

600 bil+ records3.5 mil+ jobs

Product11

Data Collection Data Warehouse Data Analysis

Open-SourceLog Collector

2,500+ companies(incl. LinkedIn, etc)

Bulk Loader

CSV / TSVMySQL, Postgres

Oracle, etc.

Web Log

App Log

Sensor

RDBMS

CRM

ERP

Streaming Upload

60billion / month

BI Tools

Tableau, QlickView,Pentaho, Excel, etc.

RESTJDBC / ODBC

SQL(HiveQL)

orPig

Bulk UploadParallel Upload

Value Proposition:“Time-to-Answer” 20bil+, 2 weeks,

UK/Austria3bil+, 3 weeks

Singapore2 weeks,

US

2 weeks, US

3 weeks,Japan

Dashboard

Custom App,RDBMS, FTP, etc.

Result push

Multi-Tenant: Single Code for Everyone - Improving the Platform Faster (e.g. SFDC, Heroku)

Sunday, July 14, 13

Page 13: Treasure Data Cloud Strategy

12

Our Customers – 80 companies

http://docs.treasure-data.com/categories/success-stories

Sunday, July 14, 13

Page 14: Treasure Data Cloud Strategy

13

A case: “14 Days” from Signup to Success

1. Europe’s largest mobile ad exchange.

2. Serving >20 billion imps/month for >15,000 mobile apps (Q1 2013)

3. Immediate need of analytics infrastructure: ASAP!

4. With TD, MobFox got into production only in 14 days, by one engineer.

"Time is the most precious asset in our fast-moving business,and Treasure Data saved us a lot of it."

Julian Zehetmayr, CEO & Founder

td-agent = fluentd rpm/deb

Sunday, July 14, 13

Page 15: Treasure Data Cloud Strategy

14

A case: “Replace” in-house Hadoop to TD

1. Global “Hulu” - Online Video Service with millions of users

2. Video contents are distributed to over 150 languages.

3. Had hard time maintaining Hadoop cluster

4. With TD, Viki deprecated their in-house Hadoop cluster and use engineer for core businesses.

Before

After

“Treasure Data has always given us thorough and timely support peppered with insightful tips to make the best use of their service."

Huy Nguyen, Software Engineer

Sunday, July 14, 13

Page 16: Treasure Data Cloud Strategy

15

A case: Treasure Data with BI Tool (Tableau)

1. World’s largest android application market

2. Serving >3 billion app downloads for >100 million users

3. Only one engineer managing the data infrastructure

4. With TD, the data engineer can focus on analyzing data with existing BI tool

"I will recommend Treasure Data to my friends in a heartbeat because it benefits all three stakeholders: Operations, Engineering and Business."

Simon Dong, Principal Architect - Data Engineering

Sunday, July 14, 13

Page 17: Treasure Data Cloud Strategy

16

- Vision -Single Analytics Platform for the World

http://www.chisite.org/initiatives/WGII

Sunday, July 14, 13

Page 18: Treasure Data Cloud Strategy

Treasure Data’sService Architecture

Sunday, July 14, 13

Page 19: Treasure Data Cloud Strategy

18

Treasure Data = Collect + Store + Query

Sunday, July 14, 13

Page 20: Treasure Data Cloud Strategy

19

Architecture Breakdown

Data Collection• Increasing variety of

data sources• No single data schema• Lack of streaming data

collection method• 60% of Big Data project

resource consumed

Data Store/Analytics• Remaining complexity in

both traditional DWH and Hadoop (very slow time to market)

• Challenges in scaling data volume and expanding cost.

Connectivity• Required to ensure

connectivity with existing BI/visualization/apps by JDBC, ODBC and REST.

• Output ot other services, e.g. S3, RDBMS, etc.

Sunday, July 14, 13

Page 21: Treasure Data Cloud Strategy

Product Philosophy

§ Data first, Schema later• “Schema-on-Read”• Both Batch and Query processing

§ Simple APIs• Easy to use and powerful

§ Easy integration• Log collecting, BI tools and etc...

20

Sunday, July 14, 13

Page 22: Treasure Data Cloud Strategy

Our technology stack

§ td-agent• ETL part of Treasure Data

§ Plazma• Big data processing infrastructure• Columnar oriented storage• Reliable data handling

§ Multi-tenant scheduler• Robust distributed queue and scheduler

21

Sunday, July 14, 13

Page 23: Treasure Data Cloud Strategy

§ 60% of BI project resource is consumed here§ Most ‘underestimated’ and ‘unsexy’ but MOST important§ Fluentd: OSS lightweight but robust Log Collector

• http://fluentd.org/

1) Data Collection

22

Sunday, July 14, 13

Page 24: Treasure Data Cloud Strategy

Apache

App

App

Other data sources

td-agent RDBMS

Treasure Data columnar data

warehouse

Query Processing Cluster

Query API

HIVE, PIG

JDBC, REST

User

td-command

BI apps

23

This!

Sunday, July 14, 13

Page 25: Treasure Data Cloud Strategy

fluentd.org

Fluentdthe missing log collector

24

Sunday, July 14, 13

Page 26: Treasure Data Cloud Strategy

Data Processing

Collect Store Process Visualize

Data source

Reporting Monitoring

Sunday, July 14, 13

Page 27: Treasure Data Cloud Strategy

Store Process

ClouderaHorton WorksTreasure Data

Collect Visualize

TableauExcel

R

easier & shorter time

???

Related Products

Sunday, July 14, 13

Page 28: Treasure Data Cloud Strategy

In short

§ Open sourced log collector written in Ruby• Easy to use, reliable and well performance

• like streaming event processing

§ Using rubygems ecosystem for plugins

27

It’s like syslogd, butuses JSON for log messages

Sunday, July 14, 13

Page 29: Treasure Data Cloud Strategy

tail

insert

eventbuffering

127.0.0.1 - - [11/Dec/2012:07:26:27] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:30] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:32] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:40] "GET / ...127.0.0.1 - - [11/Dec/2012:07:27:01] "GET / ...

...

28

Fluentd

Web Server

Example (apache to monogdb)2012-12-11 07:26:27

apache.log

{ "host": "127.0.0.1", "method": "GET", ...}

Sunday, July 14, 13

Page 30: Treasure Data Cloud Strategy

Application

・・・

Server2

Application

・・・

Server3

Application

・・・

Server1

FluentLog Server

High Latency!must wait for a day...

29

Before Fluentd

Sunday, July 14, 13

Page 31: Treasure Data Cloud Strategy

Application

・・・

Server2

Application

・・・

Server3

Application

・・・

Server1

Fluentd Fluentd Fluentd

Fluentd Fluentd

In streaming!

30

After Fluentd

Sunday, July 14, 13

Page 32: Treasure Data Cloud Strategy

Buffer Output

Input

> Forward> HTTP> File tail> dstat> ...

> Forward> File> MongoDB> ...

> File> Memory

31

Pluggable architecture

Engine

Output

> rewrite> ...

Pluggable Pluggable

Sunday, July 14, 13

Page 33: Treasure Data Cloud Strategy

Nagios

MongoDB

Hadoop

Alerting

Amazon S3

Analysis

Archiving

MySQL

Apache

Frontend

Access logs

syslogd

App logs

System logs

Backend

Databasesbuffer / filter / routing

32

Sunday, July 14, 13

Page 34: Treasure Data Cloud Strategy

td-agent

§ Open sourced distribution package of Fluentd• ETL part of Treasure Data

• rpm, deb and homebrew

§ Including useful components• ruby, jemalloc, fluentd

• 3rd party gems: td, mongo, webhdfs, etc...

• td plugin is for Treasure Data

§ http://packages.treasure-data.com/

33

Sunday, July 14, 13

Page 35: Treasure Data Cloud Strategy

§ Remaining complexity in both DWH and Hadoop§ Challenges in scaling data volume and expanding cost§ Plazma: Hadoop eco system and own projects

2) Data Store / Analytics

34

Sunday, July 14, 13

Page 36: Treasure Data Cloud Strategy

Apache

App

App

Other data sources

td-agent RDBMS

Treasure Data columnar data

warehouse

Query Processing Cluster

Query API

HIVE, PIG

JDBC, REST

User

td-command

BI apps

35

This!

Sunday, July 14, 13

Page 37: Treasure Data Cloud Strategy

AWS Component Dependencies (1)

§ RDS• Store user information, job status, etc...• Store metadata of our columnar database• Queue worker / Scheduler

§ EC2• API servers (Ruby on Rails 3)• Hadoop clusters• Job workers

• Using Chef to deploy

36

Sunday, July 14, 13

Page 38: Treasure Data Cloud Strategy

AWS Component Dependencies (2)§ ELB

• Load balancing of API servers• Load balancing of td-agents

§ S3• Columnar storage built on top of S3

• MessagePack columnar format• Realtime / Archive storage

• Our Result feature supports S3 output.

37

No EBS, EMR, SQS and other products !

Sunday, July 14, 13

Page 39: Treasure Data Cloud Strategy

FrontendQueue

WorkerHadoop

Fluentd

Applications push metrics to Fluentd(via local Fluentd)

Librato Metricsfor realtime analysis

Treasure Data

for historical analysis

Fluentd sums up data minutes(partial aggregation)

Treasure Data Service Processing Flow

38

Hadoop

Sunday, July 14, 13

Page 40: Treasure Data Cloud Strategy

39

Data Processing Flow

Sunday, July 14, 13

Page 41: Treasure Data Cloud Strategy

Structure of Columnar Storages

Realtime Storage

merge (every 1 hour)

2013-07-12 00:23:00 912ec802013-07-13 00:01:00 277a2592013-07-14 00:02:00 d52c831

...

23c82b0ba3405d4c15aa85d2190e6d7b1482412ab14f0332b8aee1198a7bc848b2791b8fd603c719e54f0e3d402b17638477c9a7977e7dab

...

SELECT ...

Archive Storage

Data import

40

Sunday, July 14, 13

Page 42: Treasure Data Cloud Strategy

Query Language

Query Execution

Columnar Data

Object Storage

41

Sunday, July 14, 13

Page 43: Treasure Data Cloud Strategy

1/4: Compile SQL into MapReduce

SELECT COUNT(DISTINCT ip) FROM tbl;

SQL Statement

HiveSQL - to - MapReduce

42

+TD UDFs

Sunday, July 14, 13

Page 44: Treasure Data Cloud Strategy

2/4: MapReduce is executed in parallel

SELECT COUNT(DISTINCT ip) FROM tbl;

43

Sunday, July 14, 13

Page 45: Treasure Data Cloud Strategy

3/4: Columnar Data Access

Read ONLY the Required Part of Data

SELECT COUNT(DISTINCT ip) FROM tbl;

44

Sunday, July 14, 13

Page 46: Treasure Data Cloud Strategy

4/4: Object-based Storage

45

Sunday, July 14, 13

Page 47: Treasure Data Cloud Strategy

Apply Schema

{“user”:54, “name”:”test”, “value”:”120”, “host”:”local”}

Schema user:int name:string value:int

SELECT 54 (int)

Raw data(JSON)

“test” (string) 120 (int)

host:int

NULL

46

Sunday, July 14, 13

Page 48: Treasure Data Cloud Strategy

Multi-Tenancy§ All customers share the Hadoop clusters (Multi Data Centers)§ Resource Sharing (Burst Cores), Rapid Improvement, Ease of Upgrade

47

datacenter A

datacenter B

datacenter C

datacenter D

Local FairScheduler

Local FairScheduler

Local FairScheduler

Local FairScheduler

GlobalScheduler

On-DemandResouce Allocation

Job Submission+ Plan Change

Sunday, July 14, 13

Page 49: Treasure Data Cloud Strategy

Trial and error on Cloud

§ Rapid development• Change hardware

• New architecture testing• Performance testing

• Change software• Hadoop parameters• etc...

§ Use git and chef for these purposes• Easy to deploy and apply changes• git for change history

48

Sunday, July 14, 13

Page 50: Treasure Data Cloud Strategy

§ Services• CopperEgg• Librato Metrics• Logentries• NewRelic• PagerDuty• Desk.com• Olark• HipChat

• Alerting

Our Operation Stack: Full Use of SaaS

49

§ Tools• Hosted Chef (Opscode)• Jenkins

• including integration test

44

Sunday, July 14, 13

Page 51: Treasure Data Cloud Strategy

Sunday, July 14, 13

Page 52: Treasure Data Cloud Strategy

Sunday, July 14, 13

Page 53: Treasure Data Cloud Strategy

Sunday, July 14, 13

Page 54: Treasure Data Cloud Strategy

53

3) Connectivity§ Need to visualize the query result§ Use metrics / graph for interactive comparison§ Result: Export result and use existence tools

45

Sunday, July 14, 13

Page 55: Treasure Data Cloud Strategy

Apache

App

App

Other data sources

td-agent RDBMS

Treasure Data columnar data

warehouse

Query Processing Cluster

Query API

HIVE, PIG

JDBC, REST

User

td-command

BI apps

54

This!

Sunday, July 14, 13

Page 56: Treasure Data Cloud Strategy

55

Pull and Push approaches

Query(Pull)

Web App

MySQLTreasure Data

Columnar Storage

QueryProcessingCluster

Query API

REST API

JDBC, ODBC Driver

td-command

BI apps

S3

Result(Push)

Sunday, July 14, 13

Page 57: Treasure Data Cloud Strategy

Support list

56

§ Result• Treasure Data• MySQL• PostgreSQL• Google SpreadSheet• REST API• S3• etc...

§ BI tool• Pentaho• Tableau• JasperSoft• Indicee• Dr. Sum• Metric Insight• etc...

http://docs.treasure-data.com/categories/3rd-party-tools-overview

http://docs.treasure-data.com/categories/result

Sunday, July 14, 13

Page 58: Treasure Data Cloud Strategy

§ Treasure Data• Cloud based Big-data analytics platform• Provide Machete for Big data reporting

§ Big Data processing• Collect / Store / Analytics / Visualization

§ Consider trade-off• Cloud reinforces idea but not differentiator

• What is the strong point?• Should focus own vision!

Conclusion

57

Our focus!

Sunday, July 14, 13

Page 59: Treasure Data Cloud Strategy

Big Data for the Rest of Us

www.treasure-data.com | @TreasureData

Sunday, July 14, 13