mongodb days germany: data processing with mongodb

33
MongoDB and Apache Flink / Spark “How to do Data Processing?” Marc Schwering Sr. Solution Architect EMEA [email protected] @m4rcsch

Upload: mongodb

Post on 23-Jan-2018

873 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: MongoDB Days Germany: Data Processing with MongoDB

MongoDB and Apache Flink / Spark

“How to do Data Processing?”

Marc Schwering

Sr. Solution Architect – EMEA

[email protected]

@m4rcsch

Page 2: MongoDB Days Germany: Data Processing with MongoDB

2

Agenda For This Session

• Data Processing Architectural Overview

• The Life of an Application

• Separation of Concerns / Real World Architecture

• Apache Spark and Flink Data Processing Projects

• Clustering with Apache Flink

• Next Steps

Page 3: MongoDB Days Germany: Data Processing with MongoDB

3

Data Processing Architectural Overview

1. Profile created

2. Enrich with public data

3. Capture activity

4. Clustering analysis

5. Define Personas

6. Tag with personas

7. Personalize interactions

Batch analytics

Public data

Common

technologies

• R

• Hadoop

• Spark

• Python

• Java

• Many other

options Personas

changed much

less often than

tagging

Page 4: MongoDB Days Germany: Data Processing with MongoDB

4

Evolution of a Profile (1)

{

"_id" : ObjectId("553ea57b588ac9ef066428e1"),

"ipAddress" : "216.58.219.238",

"referrer" : ”kay.com",

"firstName" : "John",

"lastName" : "Doe",

"email" : "[email protected]"

}

Page 5: MongoDB Days Germany: Data Processing with MongoDB

5

Evolution of a Profile (n+1){

"_id" : ObjectId("553e7dca588ac9ef066428e0"),

"firstName" : "John",

"lastName" : "Doe",

"address" : "229 W. 43rd St.",

"city" : "New York",

"state" : "NY",

"zipCode" : "10036",

"age" : 30,

"email" : "[email protected]",

"twitterHandle" : "johndoe",

"gender" : "male",

"interests" : [

"electronics",

"basketball",

"weightlifting",

"ultimate frisbee",

"traveling",

"technology"

],

"visitedCounts" : {

"watches" : 3,

"shirts" : 1,

"sunglasses" : 1,

"bags" : 2

},

"purchases" : [

{

"id" : 1,

"desc" : "Power Oxford Dress Shoe",

"category" : "Mens shoes"

},

{

"id" : 2,

"desc" : "Striped Sportshirt",

"category" : "Mens shirts"

}

],

"persona" : "shoe-fanatic”

}

Page 6: MongoDB Days Germany: Data Processing with MongoDB

6

One size/document fits all?

• Profile Data

– Preferences

– Personal information

• Contact information

• DOB, gender, ZIP...

• Customer Data

– Purchase History

– Marketing History

• „Session Data“

– View History

– Shopping Cart Data

– Information Broker Data

• Personalisation Data

– Persona Vectors

– Product and Category recommendations

Application

Batch analytics

Page 7: MongoDB Days Germany: Data Processing with MongoDB

7

Separation of Concerns

• Profile Data

– Preferences

– Personal information

• Contact information

• DOB, gender, ZIP...

• Customer Data

– Purchase History

– Marketing History

• „Session Data“

– View History

– Shopping Cart Data

– Information Broker Data

• Personalisation Data

– Persona Vectors

– Product and Category recommendations

Batch analytics Layer

Frontend - System

Profile ServiceCustomer

ServiceSession Service Persona Service

Page 8: MongoDB Days Germany: Data Processing with MongoDB

8

Benefits

• Code does less, Document and Code stays focused

• Split ability

– Different Teams

– New Languages

– Defined Dependencies

Page 9: MongoDB Days Germany: Data Processing with MongoDB

9

Advice for Developers (1)

• Code does less, Document and Code stays focused

• Split ability

– Different Teams

– New Languages

– Defined Dependencies

KISS

=> Keep it simple and save!

=> Clean Code <=

• Robert C. Marten: https://cleancoders.com/

• M. Fowler / B. Meyer. et. al.: Command Query Separation

Page 10: MongoDB Days Germany: Data Processing with MongoDB

Analytics and Personalization

From Query to Clustering

Page 11: MongoDB Days Germany: Data Processing with MongoDB

11

Separation of Concerns

• Profile Data

– Preferences

– Personal information

• Contact information

• DOB, gender, ZIP...

• Customer Data

– Purchase History

– Marketing History

• „Session Data“

– View History

– Shopping Cart Data

– Information Broker Data

• Personalisation Data

– Persona Vectors

– Product and Category recommendations

Batch analytics Layer

Frontend – System

Profile ServiceCustomer

ServiceSession Service Persona Service

Page 12: MongoDB Days Germany: Data Processing with MongoDB

12

Separation of Concerns

• Profile Data

– Preferences

– Personal information

• Contact information

• DOB, gender, ZIP...

• Customer Data

– Purchase History

– Marketing History

• „Session Data“

– View History

– Shopping Cart Data

– Information Broker Data

• Personalisation Data

– Persona Vectors

– Product and Category recommendations

Batch analytics Layer

Frontend – System

Profile ServiceCustomer

ServiceSession Service Persona Service

Page 13: MongoDB Days Germany: Data Processing with MongoDB

13

Architecture revised

Profile ServiceCustomer

ServiceSession Service Persona Service

Frontend – System Backend– Systems

Data

Processing

Page 14: MongoDB Days Germany: Data Processing with MongoDB

14

Advice for Developers (2)

• OWN YOUR DATA! (but only relevant Data)

• Say no! (to direct Data ie. DB Access)

Page 15: MongoDB Days Germany: Data Processing with MongoDB

Data Processing Solutions

Page 16: MongoDB Days Germany: Data Processing with MongoDB

16

Hadoop in a Nutshell

• An open source distributed storage and

distributed batch oriented processing framework

• Hadoop Distributed File System (HDFS) to store data on

commodity hardware

• Yarn as resource management platform

• MapReduce as programming model working on top of HDFS

Page 17: MongoDB Days Germany: Data Processing with MongoDB

17

Spark in a Nutshell

• Spark is a top-level Apache project

• Can be run on top of YARN and can read any

Hadoop API data, including HDFS or MongoDB

• Fast and general engine for large-scale data processing and

analytics

• Advanced DAG execution engine with support for data locality

and in-memory computing

Page 18: MongoDB Days Germany: Data Processing with MongoDB

18

Flink in a Nutshell

• Flink is a top-level Apache project

• Can be run on top of YARN and can read any

Hadoop API data, including HDFS or MongoDB

• A distributed streaming dataflow engine

• Streaming and batch

• Iterative in memory execution and handling

• Cost based optimizer

Page 19: MongoDB Days Germany: Data Processing with MongoDB

19

Latency of query operations

Query Aggregation MapReduce Cluster Algorithms

tim

e

MongoDB

Hadoop

Spark/Flink

Page 20: MongoDB Days Germany: Data Processing with MongoDB

Iterative Algorithms / Clustering

Page 21: MongoDB Days Germany: Data Processing with MongoDB

21

K-Means in Pictures

• Source: Wikipedia K-Means

Page 22: MongoDB Days Germany: Data Processing with MongoDB

22

K-Means as a Process

Page 23: MongoDB Days Germany: Data Processing with MongoDB

23

Iterations in Hadoop and Spark

Page 24: MongoDB Days Germany: Data Processing with MongoDB

24

Iterations in Flink

• Dedicated iteration operators

• Tasks keep running for the iterations, not redeployed for each step

• Caching and optimizations done automatically

Page 25: MongoDB Days Germany: Data Processing with MongoDB

Examplecode

Page 26: MongoDB Days Germany: Data Processing with MongoDB

26

Reader / Writer Config

//reader config

public static DataSet<Tuple2<BSONWritable, BSONWritable>> readFromMongo(ExecutionEnvironment env,

String uri) {

JobConf conf = new JobConf();

conf.set("mongo.input.uri", uri);

MongoInputFormat mongoInputFormat = new MongoInputFormat();

return env.createHadoopInput(mongoInputFormat, BSONWritable.class, BSONWritable.class, conf);

}

//writer config

public static void writeToMongo(DataSet<Tuple2<BSONWritable, BSONWritable>> result, String uri) {

JobConf conf = new JobConf();

conf.set("mongo.output.uri", uri);

MongoOutputFormat<BSONWritable, BSONWritable> mongoOutputFormat = new

MongoOutputFormat<BSONWritable, BSONWritable>();

result.output(new HadoopOutputFormat<BSONWritable, BSONWritable>(mongoOutputFormat, conf));

}

Page 27: MongoDB Days Germany: Data Processing with MongoDB

27

Import data

//points

DataSet<Tuple2<BSONWritable, BSONWritable>> inPoints = readFromMongo(env, mongoInputUri + pointsSource);

//centers

DataSet<Tuple2<BSONWritable, BSONWritable>> inCenters = readFromMongo(env, mongoInputUri + centerSource);

DataSet<Point> points = convertToPointSet(inPoints);

DataSet<Centroid> centroids = convertToCentroidSet(inCenters);

Page 28: MongoDB Days Germany: Data Processing with MongoDB

28

Converting

public Tuple2<BSONWritable, BSONWritable> map(Tuple2<Integer, Point> integerPointTuple2) throws Exception {

Integer id = integerPointTuple2.f0;

Point point = integerPointTuple2.f1; BasicDBObject idDoc = new BasicDBObject();

idDoc.put("_id", id);

BSONWritable bsonId = new BSONWritable();

bsonId.setDoc(idDoc);

BasicDBObject doc = new BasicDBObject();

doc.put("_id", id);

doc.put("x", point.x);

doc.put("y", point.y);

BSONWritable bsonDoc = new BSONWritable();

bsonDoc.setDoc(doc);

return new Tuple2(bsonId,bsonDoc);

}

Page 29: MongoDB Days Germany: Data Processing with MongoDB

29

Result

Page 30: MongoDB Days Germany: Data Processing with MongoDB

30

More…?

Page 31: MongoDB Days Germany: Data Processing with MongoDB

31

Takeaways

• Evolution is amazing and exiting!– Be ready to learn new things, ask questions across Silos!

• Stay focused => Start and stay small– Evaluate with BigDocuments but do a PoC focussed on the topic

• Extending functionality could be challenging– Evolution is outpacing help channels

– A lot of options (Spark, Flink, Storm, Hadoop….)

– More than just a binary

• Extending functionality is easy– Aggregation, MapReduce

– Connectors opening a new variety of Use Cases

Page 32: MongoDB Days Germany: Data Processing with MongoDB

32

Next Steps

• Try out Flink

– http://flink.apache.org/

– https://github.com/mongodb/mongo-hadoop

– https://github.com/m4rcsch/flink-mongodb-example

– http://sparkbigdata.com

• Participate and ask Questions!

– @m4rcsch

[email protected]

Page 33: MongoDB Days Germany: Data Processing with MongoDB

Thank you!

Marc Schwering

Sr. Solutions Architect – EMEA

[email protected]

@m4rcsch