mongodb days germany: data processing with mongodb

MongoDB and Apache Flink / Spark

“How to do Data Processing?”

Marc Schwering

Sr. Solution Architect – EMEA

marc@mongodb.com

@m4rcsch

Agenda For This Session

• Data Processing Architectural Overview

• The Life of an Application

• Separation of Concerns / Real World Architecture

• Apache Spark and Flink Data Processing Projects

• Clustering with Apache Flink

• Next Steps

Data Processing Architectural Overview

1. Profile created

2. Enrich with public data

3. Capture activity

4. Clustering analysis

5. Define Personas

6. Tag with personas

7. Personalize interactions

Batch analytics

Public data

Common

technologies

• Hadoop

• Spark

• Python

• Java

• Many other

options Personas

changed much

less often than

tagging

Evolution of a Profile (1)

"_id" : ObjectId("553ea57b588ac9ef066428e1"),

"ipAddress" : "216.58.219.238",

"referrer" : ”kay.com",

"firstName" : "John",

"lastName" : "Doe",

"email" : "johndoe@gmail.com"

Evolution of a Profile (n+1){

"_id" : ObjectId("553e7dca588ac9ef066428e0"),

"firstName" : "John",

"lastName" : "Doe",

"address" : "229 W. 43rd St.",

"city" : "New York",

"state" : "NY",

"zipCode" : "10036",

"age" : 30,

"email" : "john.doe@mongodb.com",

"twitterHandle" : "johndoe",

"gender" : "male",

"interests" : [

"electronics",

"basketball",

"weightlifting",

"ultimate frisbee",

"traveling",

"technology"

"visitedCounts" : {

"watches" : 3,

"shirts" : 1,

"sunglasses" : 1,

"bags" : 2

"purchases" : [

"id" : 1,

"desc" : "Power Oxford Dress Shoe",

"category" : "Mens shoes"

"id" : 2,

"desc" : "Striped Sportshirt",

"category" : "Mens shirts"

"persona" : "shoe-fanatic”

One size/document fits all?

• Profile Data

– Preferences

– Personal information

• Contact information

• DOB, gender, ZIP...

• Customer Data

– Purchase History

– Marketing History

• „Session Data“

– View History

– Shopping Cart Data

– Information Broker Data

• Personalisation Data

– Persona Vectors

– Product and Category recommendations

Application

Batch analytics

Separation of Concerns

• Profile Data

– Preferences

• Customer Data

– View History

– Persona Vectors

Batch analytics Layer

Frontend - System

Profile ServiceCustomer

ServiceSession Service Persona Service

Benefits

• Code does less, Document and Code stays focused

• Split ability

– Different Teams

– New Languages

– Defined Dependencies

Advice for Developers (1)

• Code does less, Document and Code stays focused

• Split ability

– Different Teams

– New Languages

– Defined Dependencies

=> Keep it simple and save!

=> Clean Code <=

• Robert C. Marten: https://cleancoders.com/

• M. Fowler / B. Meyer. et. al.: Command Query Separation

Analytics and Personalization

From Query to Clustering

• Profile Data

– Preferences

• Customer Data

– View History

– Persona Vectors

Frontend – System

• Profile Data

– Preferences

• Customer Data

– View History

– Persona Vectors

Frontend – System

Architecture revised

Frontend – System Backend– Systems

Processing

Advice for Developers (2)

• OWN YOUR DATA! (but only relevant Data)

• Say no! (to direct Data ie. DB Access)

Data Processing Solutions

Hadoop in a Nutshell

• An open source distributed storage and

distributed batch oriented processing framework

• Hadoop Distributed File System (HDFS) to store data on

commodity hardware

• Yarn as resource management platform

• MapReduce as programming model working on top of HDFS

Spark in a Nutshell

• Spark is a top-level Apache project

• Can be run on top of YARN and can read any

Hadoop API data, including HDFS or MongoDB

• Fast and general engine for large-scale data processing and

analytics

• Advanced DAG execution engine with support for data locality

and in-memory computing

Flink in a Nutshell

• Flink is a top-level Apache project

• Can be run on top of YARN and can read any

Hadoop API data, including HDFS or MongoDB

• A distributed streaming dataflow engine

• Streaming and batch

• Iterative in memory execution and handling

• Cost based optimizer

Latency of query operations

Query Aggregation MapReduce Cluster Algorithms

MongoDB

Hadoop

Spark/Flink

Iterative Algorithms / Clustering

K-Means in Pictures

• Source: Wikipedia K-Means

K-Means as a Process

Iterations in Hadoop and Spark

Iterations in Flink

• Dedicated iteration operators

• Tasks keep running for the iterations, not redeployed for each step

• Caching and optimizations done automatically

Examplecode

Reader / Writer Config

//reader config

public static DataSet<Tuple2<BSONWritable, BSONWritable>> readFromMongo(ExecutionEnvironment env,

String uri) {

JobConf conf = new JobConf();

conf.set("mongo.input.uri", uri);

MongoInputFormat mongoInputFormat = new MongoInputFormat();

return env.createHadoopInput(mongoInputFormat, BSONWritable.class, BSONWritable.class, conf);

//writer config

public static void writeToMongo(DataSet<Tuple2<BSONWritable, BSONWritable>> result, String uri) {

JobConf conf = new JobConf();

conf.set("mongo.output.uri", uri);

MongoOutputFormat<BSONWritable, BSONWritable> mongoOutputFormat = new

MongoOutputFormat<BSONWritable, BSONWritable>();

result.output(new HadoopOutputFormat<BSONWritable, BSONWritable>(mongoOutputFormat, conf));

Import data

//points

DataSet<Tuple2<BSONWritable, BSONWritable>> inPoints = readFromMongo(env, mongoInputUri + pointsSource);

//centers

DataSet<Tuple2<BSONWritable, BSONWritable>> inCenters = readFromMongo(env, mongoInputUri + centerSource);

DataSet<Point> points = convertToPointSet(inPoints);

DataSet<Centroid> centroids = convertToCentroidSet(inCenters);

Converting

public Tuple2<BSONWritable, BSONWritable> map(Tuple2<Integer, Point> integerPointTuple2) throws Exception {

Integer id = integerPointTuple2.f0;

Point point = integerPointTuple2.f1; BasicDBObject idDoc = new BasicDBObject();

idDoc.put("_id", id);

BSONWritable bsonId = new BSONWritable();

bsonId.setDoc(idDoc);

BasicDBObject doc = new BasicDBObject();

doc.put("_id", id);

doc.put("x", point.x);

doc.put("y", point.y);

BSONWritable bsonDoc = new BSONWritable();

bsonDoc.setDoc(doc);

return new Tuple2(bsonId,bsonDoc);

Result

More…?

Takeaways

• Evolution is amazing and exiting!– Be ready to learn new things, ask questions across Silos!

• Stay focused => Start and stay small– Evaluate with BigDocuments but do a PoC focussed on the topic

• Extending functionality could be challenging– Evolution is outpacing help channels

– A lot of options (Spark, Flink, Storm, Hadoop….)

– More than just a binary

• Extending functionality is easy– Aggregation, MapReduce

– Connectors opening a new variety of Use Cases

Next Steps

• Try out Flink

– http://flink.apache.org/

– https://github.com/mongodb/mongo-hadoop

– https://github.com/m4rcsch/flink-mongodb-example

– http://sparkbigdata.com

• Participate and ask Questions!

– @m4rcsch

– marc@mongodb.com

Thank you!

Marc Schwering

Sr. Solutions Architect – EMEA

marc@mongodb.com

@m4rcsch

mongodb days germany: data processing with mongodb

Technology

mongodb days silicon valley: data analysis and mapreduce...

mongodb europe 2016 - mongodb 3.4 preview and introduction...

mongodb days uk: run mongodb on google cloud platform

mongodb days silicon valley: best practices for upgrading to...

mongodb days silicon valley: operational best practices with...

mongodb 3.0 migration - mongodb days munich

nonviolent communication - xp days germany 2014

girls days 2014 - cisco germany - duesseldorf

mongodb days silicon valley: implementing graph databases...

mongodb days uk: building an enterprise data fabric at royal...

app store optimization m-days 2013 in frankfurt / germany

mongodb days silicon valley: winning the dreamforce...

mongodb world 2016: mongodb & ibm

mongodb days uk: scaling mongodb with docker and cgroups

mongodb days uk: mongodb and spark

[email protected] - xp days germany 2013

mongodb days silicon valley: mongodb and ibm linuxone

november, 2009 i xp days germany, karlsruhe · agile...

mongodb days silicon valley: building an artificial...

mongodb - drupal dev days