b34 extremely tuned hadoop cluster by daisuke hirama
DESCRIPTION
TRANSCRIPT
![Page 1: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/1.jpg)
Extremely Tuned Hadoop Cluster
平間 大輔株式会社インサイトテクノロ
ジー
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
~ RDBMS を愛する私たちは 如何にして Hadoop を愛すべきか
~
![Page 2: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/2.jpg)
「 Big Data 」「 Big Data 」「ビッグデータ」!
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
![Page 3: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/3.jpg)
ビッグデータといえば Hadoop… なぜ?
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
PB
![Page 4: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/4.jpg)
Hadoop のコアは HDFS と MapReduce
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
![Page 5: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/5.jpg)
これが MapReduce だ! ( 面倒くさい… )
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
「 MapReduce: Simplified Data Processing on Large Clusters 」より
犬も猫も好き。
Key= 犬 value=1
Key= 猫 value=1
Key= 犬 value=10
Key= 猫 value=12
![Page 6: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/6.jpg)
MapReduce の具体例
1 億 1400 万件 260GB
(1 ヶ月分の 1%)
![Page 7: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/7.jpg)
1 ツイートを受信すると… (JSON データ )
{"text":"\u81ea\u5206\u304c\u4fe1\u3058\u3089\u308c\u308b\u3060\u3051\u3058\u3083\u306a\u304f\u3066\u81ea\u5206\u306e\u3053\u3068\u3092\u4fe1\u3058\u3066\u304f\u308c\u308b\u4eba\u305f\u3061\u306e\u3053\u3068\u306f\u5927\u5207\u306b\u3002\/\u4ec1","contributors":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"retweet_count":0,"in_reply_to_screen_name":null,"in_reply_to_user_id_str":null,"retweeted":false,"source":"web","entities":{"urls":[],"hash-tags":[],"user_mentions":[]},"place":null,"in_reply_to_status_id":null,"id_str":"241415049216925697","coordinates":null,"user":{"statuses_count":1432,"geo_enabled":false,"profile_link_color":"0084B4","verified":false,"profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/603245248\/obluhsv93jc29erghpt1.gif","default_profile_image":false,"friends_count":378,"profile_background_color":"C0DEED","location":"","is_translator":false,"profile_background_tile":true,"favourites_count":0,"description":"\u5175\u5e-abJK2\u3002\u5143\u7532\u6b66\u3002\r\n\u3059\u304d\u306a\u3082\u306e\u3002\u4ec1\u304f\u3093\/\u4e80\u3061\u3083\u3093\/KAT-TUN\/\u3084\u307e\u3074\u30fc\/\u4eae\u3061\u3083\u3093\/NEWS\/\u9234\u6728\u3048\u307f\/\u5927\u77f3\u53c2\u6708\/Taylor\u30fbMomsen\/Taylor\u30fbSwift\/Bruno\u30fbMars\/\u52a0\u85e4\u30df\u30ea\u30e4\/\u963f\u90e8\u771f\u592e\r\n\u30a2\u30e1\u30d6\u30ed\u3057\u3066\u308b\u3002\u3075\u3049\u308d\u30fc\u307f\u30fc\u3002:) hyphen\u3001\uff71\uff76\uff86\uff7c\uff6c\uff70\u3001\uff81\uff6c\uff9d\uff76\uff8a\uff9f\uff70\uff85\u304b\u3082\u3093\u304b\u3082\u3093\u2606\u5f61","profile_sidebar_fill_color":"DDEEF6","follow_request_sent":null,"contributors_enabled":false,"lang":"ja","profile_sidebar_border_color":"C0DEED","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/2393527472\/expe7e9aiw04iu3iijb0_normal.jpeg","screen_name":"manatsu5","id_str":"585589997","listed_count":3,"protected":false,"show_all_inline_media":false,"following":null,"notifications":null,"profile_use_background_image":true,"followers_count":315,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/2393527472\/expe7e9aiw04iu3iijb0_normal.jpeg","name":"ma-natsu","default_profile":false,"created_at":"Sun May 20 11:25:20 +0000 2012","profile_text_color":"333333","id":585589997,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/603245248\/obluhsv93jc29erghpt1.gif","time_zone":null,"utc_offset":null,"url":"http:\/\/ameblo.jp\/kaaaaaat-tun6\/"},"favorited":false,"id":241415049216925697,"created_at":"Fri Aug 31 06:00:07 +0000 2012","geo":null,"truncated":false}
![Page 8: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/8.jpg)
ツイート内の単語を数えてみようKey: 123456 value:{"text":"\u81ea\u5206\u304c…}
“ 吾輩は猫である。”
{“ 吾輩” ,” は” ,” 猫” ,” で” ,” ある” ,” 。” }
key:“ 吾輩” value:1
key:” 猫” value:1
key:” ある” value:1
Map
key:” ある” value:{1,1,1,3,2,1,1}
key:“ 吾輩” value:{1,2,1}
key:” 猫” value:{1,3,2,1,1,5,2}Shuffle
key:10 value:” ある”key:4 value:“ 吾輩”key:15 value:” 猫”
Reduce
JSON を解析
日本語を単語に分解( 形態素解析 )
![Page 9: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/9.jpg)
処理は Java でごりごり書く!
![Page 10: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/10.jpg)
作ったら動かそう!
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
![Page 11: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/11.jpg)
また MapReduce を書くの !?
key:15 value:” 猫”key:13 value:” 犬” key:11 value:“ ゴミ”key:11 value:” 人間”
Reducer1
key:10 value:” ある”key:4 value:“ 吾輩”
Reducer2
Shuffle
key:11 value:“ ゴミ”key:10 value:” ある”key:15 value:” 猫”key:4 value:“ 吾輩”key:11 value:” 人間”key:13 value:” 犬”
並べ替え用の自作クラスを作成
(IntWritable を継承 )シャッフル処理を変更
HashPartitioner↓
TotalOrderPartitioner
事前に key をサンプリング
先ほどの処理結果
ソートするにはもう 1 回 MapReduce 処理が必要
![Page 12: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/12.jpg)
RDBMS との連携その 1: プレーンテキスト
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
RDBMS X
![Page 13: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/13.jpg)
RDBMS との連携その 2: Sqoop
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
RDBMS X
JDBC
JDBC
JDBC
![Page 14: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/14.jpg)
RDBMS との連携その 3: 各社のConnector
RDBMS Connector 名称 概要
OracleQuest Data Connector for Oracle and Hadoop
Quest Software( 現 DELL) 社製の Sqoop プラグイン。 Sqoop 単体で Oracle にインポートする際のいくつかの制限を解消。
Oracle Oracle Loader for Hadoop
Oracle 社製のデータロード用 MapReduce アプリケーション。 Oracle Big Data Connectors に含まれる。ダイレクトパスロードや Data Pump 形式のファイルをオフラインで出力するなどの機能あり。
Oracle Oracle SQL Connector for HDFS
Oracle 社製の HDFS 上のファイルを Oracle の外部表として扱うことができるコネクタ。 Oracle Big Data Connectors に含まれる。通常のテキストファイルのほか Hive 表と Data Pump 形式ファイルも扱える。
SQL ServerMicrosoft SQL Server Connector for Apache Hadoop
Microsoft 社製の Sqoop ベースのコネクタ。 Sqoop1.4 より Sqoop 本体に統合。
Vertica HP Vertica HDFS ConnectorHP 社製のコネクタ。 Vertica へのパラレルロードや HDFS 上のファイルを外部表として扱うことが可能。
InfiniDB InfiniDB-Hadoop Data ConnectorCalpont 社製のコネクタ。専用のファイルフォーマットを用意してパラレルロードを可能としている。
Vectorwise Vectorwise Hadoop ConnectorActian 社製のコネクタ。 MapReduce アプリケーション。パラレルロードを可能としている。
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
![Page 15: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/15.jpg)
でも… Hadoop 上で SQL は使えないの?
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
SELECT *
FROM CUSTOMER WHERE ~ ;
![Page 16: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/16.jpg)
Hive で GO!
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
CSV
メタストアデータベース
(PostgreSQL など )
SELECT *
FROM CUSTOMER WHERE ~ ;
01,HIRAMA,DAISUKE
02,YAMDA,TARO
...
![Page 17: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/17.jpg)
Hive で SQL を実行してみよう
select
l_orderkey,
sum (l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer c join
orders o
on c.c_mktsegment = 'BUILDING'
and c.c_custkey = o.o_custkey join
lineitem l
on l.l_orderkey = o.o_orderkey
where
o_orderdate < '1995-03-28'
and l_shipdate > '1995-03-28'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate limit 10
;
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
TPC-H Q3
![Page 18: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/18.jpg)
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Hive で SQL を実行してみよう
![Page 19: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/19.jpg)
オーバーヘッドは結構大きい
MapReduce のみ 処理全体0
20
40
60
80
100
120
Hive 処理MR4MR3MR2MR1
( 秒 )
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Q3 SF=10(GB) の場合
Overhead
![Page 20: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/20.jpg)
データが多けりゃ気にならない!
MapReduce のみ 処理全体0
50
100
150
200
250
300
350
Hive 処理MR4MR3MR2MR1
( 秒 )
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Q3 SF=100(GB) の場合
![Page 21: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/21.jpg)
データ 10 倍でも処理時間はたったの 2倍
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
SF=10(GB) SF=100(GB)0
1000
2000
3000
4000
5000
6000
7000
8000( 秒 )
TPC-H 用クエリ (22 個 ) を一部 Hive 用に修正して実行
![Page 22: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/22.jpg)
でも… HW のボトルネックには気をつけて
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
SF=10(GB) SF=100(GB)0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000( 秒 )
![Page 23: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/23.jpg)
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
• 緑: User CPU• 青: I/O Wait
でも… HW のボトルネックには気をつけて
![Page 24: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/24.jpg)
楽はできるようになったけど…
楽Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
速さくさくアドホックなクエリを実行
したい!
![Page 25: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/25.jpg)
Real-Time Query 、それが Impala
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Cloudera 社 Web サイトより
No MapReduce!
![Page 26: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/26.jpg)
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
では実行だ!
![Page 27: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/27.jpg)
超速!
Hive Impala0
20
40
60
80
100
120( 秒 )
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Q3 SF=10(GB)
1/5 以下!
![Page 28: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/28.jpg)
超速!
Hive Impala0
20
40
60
80
100
120
140
160
180
Insert1 Insert2 Select( 秒 )
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Q2 SF=10(GB)
1/20 以下 !
![Page 29: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/29.jpg)
あれ?でも…
Hive Impala0
50
100
150
200
250
300
350
400
450( 秒 )
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Q3 SF=100(GB)
![Page 30: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/30.jpg)
メモリ大食い、書き込みは苦手…
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
![Page 31: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/31.jpg)
このデータ量ならやはり本職が強し!
Hive Impala Vectorwise0
50
100
150
200
250
300
350
400
450( 秒 )
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Q3 SF=100(GB)
2.7 秒
![Page 32: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/32.jpg)
Hadoop を愛するためには
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
1. 面倒くさがらず、まずは MapReduce 。2. ぐちゃぐちゃデータをきれいに
し、 RDBMS と連携して分析という流れが鉄板。
3. Hive は遅い!でも使い道はある。4. Impala の活躍場所は限定的。
でもはまれば強力。
![Page 33: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/33.jpg)
さらに Hadoop への愛を深めて…
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
DBOnline で連載しました。機械学習 (Mahout) もやってみたよ。
http://enterprisezine.jp/dbonline/
![Page 34: B34 Extremely Tuned Hadoop Cluster by Daisuke Hirama](https://reader033.vdocuments.site/reader033/viewer/2022061204/547fe61eb4af9fee3b8b4913/html5/thumbnails/34.jpg)
• 無断転載を禁ず
• この文書はあくまでも参考資料であり、掲載されている情報は予告なしに変更されることがあります。• 株式会社インサイトテクノロジーは本書の内容に関していかなる保証もしません。また、本書の内容に関
連したいかなる損害についても責任を負いかねます。• 本書で使用している製品やサービス名の名称は、各社の商標または登録商標です。