suneel marthi – bigpetstore flink: a comprehensive blueprint for apache flink

23
BigPetStore-Flink A Comprehensive Blueprint for Apache Flink. Suneel Marthi Flink Forward 2015, Berlin

Upload: flink-forward

Post on 16-Apr-2017

6.105 views

Category:

Technology


3 download

TRANSCRIPT

BigPetStore-FlinkA Comprehensive Blueprint for Apache

Flink. Suneel Marthi

Flink Forward 2015, Berlin

About Me• Senior Principal Engineer, Office of Technology, Red Hat• Committer and PMC member on Apache Mahout• Contributor to DeepLearning4J and Oryx 2.0• Co-Organizer of Washington DC Apache Flink Meetup• Founder of Boston Apache Flink Meetup

Outline Of Talk• What is BigPetStore?• Why BigPetStore?• Synthetic Data• BigPetStore - MapReduce, Spark• BigPetStore - Flink• Future possibilities

What is BigPetStore?• Blueprints for Big Data

applications• Consists of:

– Data Generators– Examples using tools in

Big Data ecosystem to process data

– Build system and tests for integrating tools and multiple JVM languages

• Part of Apache Bigtop• Used for:

– Templates for infrastructure (build, integration, testing)

– Educational examples– Testing– Demos– Benchmarking

Why BigPetStore?(1)As a developer, I want an application blueprint that…• scales to a size approximating my data-domain• includes idiomatic unit and integration testing• demonstrates ETL as well analyticsIn other words…Word count was great for MapReduce, but we need something more to demonstrate the advanced capabilities of newer processing engines

Why BigPetStore?(2)PetStores have been around for a while to showcase different technologies starting with Sun’s Web Petstore in the early days of J2EE

Everyone knows what a PetStore is, hence it’s intuitive to non-developers

What about a Big Data PetStore?

Vision• Bigtop Data Generators - a resource for all Apache

projects!

• To build more sophisticated blueprints for users and developer

• Useful for smoke testing infrastructure and applications!

Case for Synthetic Data• Most company Data is private and confidential• Licensing concerns with sharing the data• Secure data cannot be moved out of production• Enable more realistic example applications• Enable more comprehensive testing than regular

wordcount or TeraSort

Bigtop Data Generators• BigPetStore Data Generator• Bigtop Weatherman• Bigtop Bazaar• Locations Library• Sampler Library• Name Generator• Product Generator

BigPetStore-Mapreduce (BIGTOP-1270)

• Originally, a MapReduce application for demonstrating Mapreduce, Pig, Mahout.

• Primitive “hierarchical” data generator for generating fake petstore transaction (at any scale).

• Part of ASF Bigtop and at Red Hat, and other companies, for testing the Hadoop ecosystem.

New Data Generator for BigPetStore

• Motivation: realistic ML/analytics examples• Goal: More complex patterns embedded in data• Mathematical modeling and simulation

– Sampling from PDFs– (Hidden) Markov Models– Poisson processes– Stochastic differential equations

Next Step: A Platform Independent Data Generator.

Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014

BigPetStore Data Model• Generative Model leveraging well-known mathematical

modeling techniques to simulate factors influencing customers’ purchasing habits.

• Several cases real data is used to parameterize the model

BigPetStore Data Model

BigPetStore-TransactionQueue• no need for API calls, just use docker• Generate load for any app: Not just JVM apps.• docker run -t -i smarthi/bigpetstore-transaction-queue

BigPetStore-Spark (BIGTOP-1535)

-RJ Nowling rewrote the BigPetStore data generator components to generate more complex data sets, with patterns varying in many dimensions.-BigPetStore-Spark was then added to ASF BigTop, demonstrating that the data generator could be used in a distributed context.

BigPetStore-Flink (Bigtop-1927 & Bigtop-1928)

• A Flink application blueprint.• Generates data at any scale.• Uses Flink streams to write generated data to disk.• Uses Flink DataStream transformations to transform data

sets for analytics.

BigPetStore Flink

Future Endeavors• How to help users build their own models?• How to use the Bigtop Data Generators for load testing?• How to produce synthetic copies from real datasets?• Better libraries and abstractions to reduce boilerplate• Research: Investigating Probabilistic Programming

Languages which provide advanced sampling and inference algorithms combined with high-level DSLs for model specifications

Future: BigPetStore - Flink A BigPetStore Blueprint for:• Flink Batch• Flink Table API• Flink ML algorithms

ResourcesNowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014

https://github.com/apache/bigtop/tree/master/bigtop-data-generators

https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore

BigTop Data Generators available as a library:http://dl.bintray.com/rnowling/bigpetstore

TL;DR• BigTop Data Generators - a resource for all Apache BigData projects

• Comprehensive Blueprints• Smoke and integration testing• Load testing

• Flink BigPetstore soon to be part of Apache Bigtop (BIGTOP-1927 & BIGTOP-1928)

• Future Endeavors• Expand BigPetStore Flink as new Flink features become available• Make models easier to build• Easier ways to generate synthetic data from models built on real data