sqlはビッグ・データにとって最適な開発言語か · •...

Copyright © 2013, Oracle and/or its affiliates.All rights reserved.1

SQLはビッグ・データにとって最適な開発言語か?Oracle Database 12cにおけるSQLの分析力を調査


Safe Harbor Statement

The following is intended to outline our general product direction.It is intended for information purposes only, and may not be incorporated into any contract.

It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.


Keith LakerSenior Principal Product Manager

Andrew WitkowskiArchitect

Sankar SubramanianSenior Director of Development


ビッグ・データのパターン検索ビッグ・データを素早く探索する今日の世界における典型的なユースケース

金融サービス

マネー・

ローンダリング

詐欺

トラッキング・ストック市場

法と秩序

怪しいアクティビティの

監視

小売り

返金詐欺購入

パターン

セッション化通信

マネー・

ローンダリング

SIMカード詐欺

通話品質

ビッグ・データ

公共事業

ネットワーク分析

詐欺

通常とは異なる

使用方法

プレゼンター

プレゼンテーションのノート

With big data comes the concept of discovery – You have to process large amounts of data to uncover the real interesting stories hidden inside. The examples of industry-specific discovery workflows listed here are based on finding patterns. In this session we are going to look at two specific examples of tracking the stock market and sessionization which are two of the classic big data problems. But what we find is that although everything starts with pattern discovery this initial process then drives other analytical workflows…such as using our other SQL analytics features (cube, rollup or the other reporting aggregates) or spatial analytics (telcos and law & order examples) or data mining operations.


SQLの継続的な進化

4 5

• ウィンドウ関数の導入

• 強化されたウィンドウ関数（パーセンタイルなど）

• ロールアップ、グルーピング・セット、キューブ

• 統計関数

• SQLモデル句

• パーティション外部結合

• データ・マイニングI

• データ・マイニングII• SQLピボット

• 再帰的WITH• ListAgg（N番目の値のウィンドウ）

• パターン・マッチング

• Top N句

• データ・マイニングIII

プレゼンター


Oracle has a long history of embedding sophisticated SQL-based analytics within the database. Window functions that were first introduced 9i are now used extensively by many developers to manage complex business requirements. 10g introduced the SQL Model clause which provides a spreadsheet-like modeling approach to SQL for business users. With 12c with have now added pattern matching and a simplified way to select the top-N values within a result set.


SQL分析を使用したパターン・マッチングJava対SQL：株式市場 - 取引データの「W」パターンを検索

SQL - 20分の1のコードで5倍高速

SELECT first_x, last_zFROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x,

LAST(z.time) AS last_zONE ROW PER MATCHPATTERN (X+ Y+ W+ Z+)DEFINE X AS (price < PREV(price)),

Y AS (price > PREV(price)),W AS (price < PREV(price)),Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

12行のSQL250行以上のJavaとPIG

package pigstuff;import java.io.IOException;import java.util.ArrayList;import java.util.Iterator;import org.apache.pig.EvalFunc;import org.apache.pig.PigException;import org.apache.pig.backend.executionengine.ExecException;import org.apache.pig.data.BagFactory;import org.apache.pig.data.DataBag;import org.apache.pig.data.DataType;import org.apache.pig.data.Tuple;import org.apache.pig.data.TupleFactory;import org.apache.pig.impl.logicalLayer.FrontendException;import org.apache.pig.impl.logicalLayer.schema.Schema;/**** @author nbayliss*/

private class V0Line {String state = null;String[] attributes;String prev = "”;String next = ””;public V0Line(String[] atts) {

attributes = atts;}

public String[] getAttributes() {return attributes;

}

public void setState(String state) {this.state = state;

}

public String setState(V0Line linePrev, V0Line lineNext) {

private boolean eq(String a, String b) {

private boolean gt(String a, String b) {

public Tuple exec(Tuple input) throws IOException {

@Overridepublic Schema outputSchema(Schema input) {

Schema.FieldSchema linenumber = new Schema.FieldSchema("linenumber", DataType.CHARARRAY);

Schema.FieldSchema pbykey = new Schema.FieldSchema("pbykey", DataType.CHARARRAY);

Schema.FieldSchema count = new Schema.FieldSchema("count", DataType.LONG);

Schema tupleSchema = new Schema();tupleSchema.add(linenumber);tupleSchema.add(pbykey);tupleSchema.add(count);return new Schema(tupleSchema);

}

}

プレゼンター


Lets look at an example based on trading data. Assume we need to look for “W-shaped” trading patterns where the duration of the pattern is less than or equal to 7 days. How much code would that require? How much code do you want to write? Probably as little as possible. Creating the code in Java for this business use case would require 250+ lines of Java and Pig code. Writing a SQL statement for this use case requires 12 lines of code. For this type of problem SQL is also faster – for our W-shape it is 5x faster compared to MapReduce and the code is simpler to understand, simpler to maintain and much simpler to enhance.


SQL分析を使用したパターン・マッチング11g対12c：通話品質分析 - 通話途切れを検索

50%少ないコード - 理解・テスト・デプロイ・管理が容易

ウィンドウ関数で複数のSelect文を使用した、24行以上の高度なSQL

With Sessionized_Call_Details as(select Caller, Callee, Start_Time, End_Time,

Sum(case when Inter_Call_Intrvl < 60 then 0 else 1 end)

over(partition by Caller, Callee order by Start_Time) Session_IDfrom (select Caller, Callee, Start_Time,

End_Time,(Start_Time - Lag(End_Time)

over(partition by Caller, Callee order by Start_Time)) Inter_Call_Intrvl

from Call_Details)),

Inter_Subcall_Intrvls as(select Caller, Callee, Start_Time, End_Time,

Start_Time - Lag(End_Time) over(partition by Caller, Callee, Session_ID order by Start_Time)

Inter_Subcall_Intrvl,Session_ID

from Sessionized_Call_Details)

Select Caller, Callee,Min(Start_Time) Start_Time,sum(End_Time - Start_Time) Eff_Call_Dur,Nvl(Sum(Inter_Subcall_Intrvl), 0)

Tot_Duration, (Count(*) - 1) No_Of_Restarts,Session_ID

from Inter_Subcall_Intrvlsgroup by Caller, Callee, Session_ID;

SELECT Caller, Callee, Start_Time, Effective_Call_Duration,

(End_Time - Start_Time) - Effective_Call_Duration

AS Tot_Duration,

No_Of_Restarts, Session_ID

FROM call_details MATCH_RECOGNIZE

( PARTITION BY Caller, Callee ORDER BY Start_Time

MEASURES

A.Start_Time AS Start_Time,

B.End_Time AS End_Time,

SUM(B.End_Time – A.Start_Time) as Eff_Call_Dur,

COUNT(B.*) as No_Of_Restarts,

MATCH_NUMBER() as Session_ID

PATTERN (A B*)

DEFINE B as B.Start_Time - prev(B.end_Time) < 60);

シンプルな12行のSQL

プレゼンター


Can I use the pattern matching feature to simplify existing SQL? Yes you can. Here is an example where we are looking for dropped calls and you can see that it is possible to use multiple-select statements along with window functions. This requires over 24 lines of code. Using the 12c pattern matching code we can reduce this to 14 lines of code but the point here is that 12c SQL is simpler to code and understand. That makes it easier to test, deploy and manage. So the new SQL analytics that we have added to 12c can make a big different to existing SQL-based projects.


SQLパターン・マッチング

重要な概念


行のシーケンスにおけるパターン認識

SQLを使用してイベント・シーケンスのパターンを認識

– シーケンスは行のストリーム

– イベントはストリームの行と同じ

新しいSQL構文、MATCH_RECOGNIZE– データを論理的にパーティションで区切り、順序付け

ORDER BYは必須（PARTITION BYは任意）

– 変数を使用した正規表現で定義されたパターン

– 正規表現と行のシーケンスを照合

– 各パターン変数は行と集計の条件で定義

「SQLパターン・マッチング」 - 概念

プレゼンター


Let’s look at the our new SQL pattern matching feature in a little more detail. The new SQL pattern matching features allows you to search for patterns in sequences of events using SQL. When we talk about sequences we mean a stream of rows – we are allowing you to look across row boundaries to search for an event and you can think of an event as a single row within the stream. The new SQL construct is called MATCH_RECOGNIZE. It allows you to logically partition your data stream and then order that data stream within each partition. The pattern is defined using the familiar syntax of regular expressions so if you have used languages like PERL this will look very familiar. Pattern variables are used to describe the specific pattern and it can be based on conditions for a row or using aggregates and we will look at this in more detail later in this presentation. Essentially we are now providing the ability to use SQL to search for patterns.


ティッカー・ストリームのW型パターンを検索

• パターンの開始日と終了日を出力

• 各W型の平均価格を計算

• 期間が1週間未満のパターンのみ検索

SQLパターン・マッチングの実行例：ティッカー・ストリームの二点底パターン（W型）を検索

日数

株価

プレゼンター


Let’s start with a simple example to help explain how you can build up a MATCH_RECOGNIZE clause. For this example we have three objectives: Output the beginning and ending dates for each W-shape pattern Calculate the average price during the w-shape Find W-shapes that lasted less than 7 days


SQLパターン・マッチングの実行例：W型を検索

SELECT ...FROM ticker MATCH_RECOGNIZE (

...)

日数

株価

SQLを使用してパターンを発見する新しい構文：

MATCH_RECOGNIZE ( )

プレゼンター


Step 1 is to add in our MATCH_RECOGNIZE clause after the FROM clause of our SELECT statement. The table TICKER will act as the input (or data stream) to our match_recognize clause



SELECT …FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time

日数

株価


• PARTITION BY句とORDER BY句を設定

これ以降、引き続き黒字の株のみを検索

プレゼンター


Step 2 – partition and order the data. In this example we only have one ticker symbol so the partition by clause is not strictly necessary. To make the pattern visible we need to order our data set by time. Note that we could have simplified this example and use the same variable multiple times in the pattern description (DOWN+ UP+ DOWN+ UP+). WE also omit an ‘always-true’ start event, but we did not do so for simplicity reasons. See the DATA WAREHOUSING GUIDE for more comprehensive examples and explanations.



日数

株価



PATTERN (X+ Y+ W+ Z+)


• パターンを定義 – 「W型」

プレゼンター


Now we define the pattern and here we are looking for one or more instances of X, followed by one or more instances of Y, followed by one or more instances of W, followed by one or more instances of Z. The obvious question is what are X, Y, W and Z.?





PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)),

日数

株価

X


• パターンを定義 – 「W型」の最初の下降部分

プレゼンター


We use the DEFINE clause to describe each of the events listed in the PATTERN clause X is defined as being an event where the current price is less than the previous price – so we are in the first down leg of our W-shape






Y AS (price > PREV(price)),

日数

株価

X Y


• パターンを定義 – 「W型」の最初の上昇部分

プレゼンター


Y is defined as being an event where the current price is greater than the previous price – so we are in the first up leg of our W-shape



日数

株価




Y AS (price > PREV(price)),W AS (price < PREV(price)),Z AS (price > PREV(price)))

X Y W Z


• パターンを定義 – 「W型」の2度目の下降部分（W）と上昇部分（Z）

プレゼンター


W is defined as being an event where the current price is less than the previous price – so we are in the second down leg of our W-shape Z is defined as being an event where the current price is greater than the previous price – so we are in the second up leg of our W-shape



日数

株価



LAST(z.time) AS last_z

PATTERN (X+ Y+ W+ Z+)DEFINE X AS (price < PREV(price)),


X Z


• パターンが一致したら、出力のメジャーを定義

• FIRST：開始日• LAST：終了日

プレゼンター


Next step is to determine what we are going to output. In this case it will be the start of our w-shape FIRST(x.time) and the end of our w-shape LAST(z.time)



• パターンとの一致を見つけるごとに1行出力


1 9 13 19 日数

株価



LAST(z.time) AS last_z

ONE ROW PER MATCHPATTERN (X+ Y+ W+ Z+)DEFINE X AS (price < PREV(price)),


First_x Last_z

1 9

13 19

プレゼンター


At this point we need to determine the type of output we want to generate and we have two options: detailed which is where we use the ALL ROWS PER MATCH or summary where we use ONE ROW PER MATCH. The box in the top right shows the output from out statement and you can see that we have discovered two W-shapes



• 期間が1週間未満のW型を検索するようにパターンを拡張

SQLパターン・マッチングの実行例：期間が7日未満のW型を検索

1 9 13 19 日数

株価



LAST(z.time) AS last_zONE ROW PER MATCHPATTERN (X+ Y+ W+ Z+)DEFINE X AS (price < PREV(price)),


z.time - FIRST(x.time) <= 7 ))

X Z

以前の変数を参照可能

プレゼンター


Now we need to refine our DEFINE statement to only find w-shapes that last less than or equal to 7 days. And you can see that we have added an additional statement to capture this requirement.



• 2度目の上昇における平均株価を計算

SQLパターン・マッチングの実行例：W型における平均株価を見つける

1 9 13 19 日数

株価

SELECT first_x, last_z, avg_priceFROM ticker MATCH_RECOGNIZE (


LAST(z.time) AS last_z,AVG(z.price) AS avg_price

ONE ROW PER MATCHPATTERN (X+ Y+ W+ Z+)DEFINE X AS (price < PREV(price)),


z.time - FIRST(x.time) <= 7 ))))

平均株価：$52.00

プレゼンター


The last step is to return the average price across our w-shape and we can use the standard SQL average command. Note that we pointed to z.price so we can get the average price during the second-up phase of our pattern.


SQLパターン・マッチングの実行

タイムスタンプ間のギャップが指定したしきい値を下回る場合は、1つのセッションを、同じパーティション・キーを持つ1つまたは複数のイベントのシーケンスとして定義

「ユーザー・ログ分析」の例

– パーティション・キー：ユーザーID、タイムスタンプ間のギャップ：10（秒）

– セッションを検出

– パーティション内（ユーザーごと）の代用Session_IDを各セッションに割当て

– 各入力タプルにそのSession_IDをアノテーション

例：ユーザー・ログのセッション化

プレゼンター


Now lets look at a classic example from web log analysis which is sessionization. Here we are going to consider a session as a single session if the gap between events is less than 10 seconds. As we identify each session we are going to tag it with a surrogate session id to make it easier to do further analysis.


SQLパターン・マッチングの実行ユーザー・ログのセッション化の例時間ユーザーID

1 Mary2 Sam11 Mary12 Sam22 Sam23 Mary32 Sam34 Mary43 Sam44 Mary47 Sam48 Sam53 Mary59 Sam60 Sam63 Mary68 Sam

時間ユーザーID1 Mary11 Mary

23 Mary

34 Mary44 Mary53 Mary63 Mary

2 Sam12 Sam22 Sam32 Sam

43 Sam47 Sam48 Sam

59 Sam60 Sam68 Sam

セッションを識別

時間ユーザーID セッション

1 Mary 111 Mary 1

23 Mary 2

34 Mary 344 Mary 353 Mary 363 Mary 3

2 Sam 112 Sam 122 Sam 132 Sam 1

43 Sam 247 Sam 248 Sam 2

59 Sam 360 Sam 368 Sam 3

ユーザーごとに

セッション番号を割当て

プレゼンター


As you can see here our log file contains the time and the user name so the first step is to group the data by user id and order it by time, which then allows us to identify each individual session.


SQLパターン・マッチングの実行ユーザー・ログのセッション化の例：MATCH_RECOGNIZE

... FROM Events MATCH_RECOGNIZE

(PARTITION BY user_ID ORDER BY timeMEASURES match_number() as session_idALL ROWS PER MATCHPATTERN (b s*) DEFINE

s as (s.time - prev(s.time) <= 10));

プレゼンター


The main difference here is the use of one of our built-in functions: MATCH_NUMNER to identify individual records in the pattern. Numbering is sequentially starting with 1 within a single pattern.


SQLパターン・マッチングの実行

分析の基盤のみ最初にセッション化

– 関連イベントを論理的に識別し、グループ化することが必須

最初のデータの洞察を集計

– 個々のセッションで「イベント」がいくつ発生したか?– 個々のセッションの所要時間の合計は?

セッション化の例 – セッション化されたデータの集計

プレゼンター


Once we have created a sessionization data set we can then move on to the next step in the analysis which is where you move into the area of clickstream analysis


SQLパターン・マッチングの実行セッション化の例 – セッション化されたデータの集計

時間ユーザーID セッション

1 Mary 111 Mary 1

23 Mary 2

34 Mary 344 Mary 353 Mary 363 Mary 3

2 Sam 112 Sam 122 Sam 132 Sam 1

43 Sam 247 Sam 248 Sam 259 Sam 360 Sam 368 Sam 3

時間 Session_ID Start_Time イベント数所要時間

Mary 1 1 2 10

Mary 2 23 1 0

Mary 3 34 4 29

Sam 1 2 4 30

Sam 2 43 3 5

Sam 3 59 3 9

ユーザーごとにセッションを集計

プレゼンター


So using our SQL pattern matching we can start to determine how many events occurred during each session, the duration of each session etc etc.


SQLパターン・マッチングセッション化の例 – 集計：ONE ROW PER MATCH

...FROM Events MATCH_RECOGNIZE

(PARTITION BY user_ID ORDER BY timeONE ROW PER MATCH

MEASURES match_number() session_id, count(*) as no_of_events,first(time) start_time, last(time) - first(time) duration

PATTERN (b s*) DEFINE

s as (s.time - prev(time) <= 10) )

ORDER BY user_id, session_id;

プレゼンター


And this is the code we would use to generate that result set. PARTITION BY user if and ORDER by time as we discussed earlier We need summary results so we can use the ONE ROW PER MATCH The measures clause contains a number of interesting analytical functions: COUNT(*) will count the number of events for each user FIRST() will retrieve our initial start time for the session LAST() will retrieve our end time for the session and using these two data points we can calculate the duration of the session. The pattern is that we are looking for event b (which is not defined so it is just a dummy event) and one or more instances of S, where S is a calculation where current time – previous time must be less than 10 (seconds in this case).


Top Nのネイティブ・サポート

プレゼンター


As we mentioned at the start of this presentation one of the benefits of being able to use SQL for pattern matching is that you can then apply further levels of analysis using other SQL features and functions. In this case we might want to select the top 10 sessions from the previous output based on duration. In 12c we have added native support for Top-N selections.


SQLで上位N人をネイティブに特定

コード開発を大幅に簡素化

ANSI SQL:2008

Top N検索のネイティブ・サポート

「当社の稼ぎ頭の上位5人は?」

SELECT empno, ename, deptnoFROM empORDER BY sal, comm FETCH FIRST 5 ROWS ONLY;

SELECT empno, ename, deptnoFROM (SELECT empno, ename, deptno, sal, comm,

row_number() OVER (ORDER BY sal,comm) rnFROM emp)

WHERE rn <=5ORDER BY sal, comm;

対

プレゼンター


Taking a different more general example – if I needed to write a statement that selected my top 5 sales reps within my company then traditionally I would have written something like the box at the top, We are windows with the row_number function to allow us to filter the result set to the first five rows. It looks cool and many DBAs and developers could write this type of SQL but some people might struggle. It would be nice if we could simplify the code and that is what we have done in 12c


Top N検索のネイティブ・サポート

ANSI 2008/2011に準拠し、追加の拡張機能をサポート

オフセット、および戻す行数または行の割合を指定

最後の行と同じソート・キーの行をさらに戻すようにプロビジョニング（WITH TIESオプション）

構文：OFFSET <offset> [ROW | ROWS]

FETCH [FIRST | NEXT]

[<rowcount> | <percent> PERCENT] [ROW | ROWS]

[ONLY | WITH TIES]

新たなオフセットとfetch_first句

プレゼンター


This is useful because it simplifies SQL code so that it is easier to understand and modify.


まとめ

ANSIに準拠し、追加の拡張機能をサポート

一般的な構文を使用することで短期間の習得が可能

SQLベースのパターン・マッチングを包括的にサポート

– 幅広いユースケースをサポート

– アプリケーション開発を簡素化

– 既存のSQLコードを簡素化

新たなTop N機能

– 既存のSQLコードを簡素化

新たなDatabase 12cのSQL分析


SQLはビッグ・データにとって最適な開発言語か?

はい。その理由はSQLが以下を実現しているからです。

簡素性高速性豊富な機能


Graphic Section Divider


その他のセッション

セッション日付場所

パターン・マッチングのハンズオン・ラボ火曜日 - 正午 Marriot Salon 3-4

Oracle Partitioningを習得する上での重要なヒント火曜日 - 午後3時45分 Moscone South 103

Oracle Optimizerブート・キャンプ火曜日 - 午後5時15分 Moscone South 102

SQLを使用したIn-Database MapReduce 水曜日 - 午前10時15分 Marriot Salon 7

Big Data Connectorsを使用したプログラミング水曜日 – 午後3時30分 Marriot Salon 7

データウェアハウスとビッグ・データ – お客様のパネリスト水曜日 – 午後3時30分 Moscone South 300

データは対話している – お客様のパネリスト水曜日 – 午後5時 Moscone South 300


さらなる情報の入手先

OTN上のSQL分析のホームページ

– http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html

– Oracle By Example – パターン・マッチング

– パターン・マッチングとSQL分析のポッドキャスト

– データ・シート

– ホワイト・ペーパー

Patterns Everywhere - Find then fast! Patterns Everywhere - Find then fast!（Apple iBook）

データウェアハウスとSQL分析のブログ

– http://oracle-big-data.blogspot.co.uk/

http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html

http://oracle-big-data.blogspot.co.uk/


本日はご参加ありがとうございました

OPENWORLDをお楽しみください

sqlはビッグ・データに とって最適な開発言語か · •...

Documents

sqlはビッグ・データにとって最適な開発言語か · •...