paytm labs soyouwanttodatascience

57
So you want to data science. Adam Muise Chief Architect

Upload: adam-muise

Post on 17-Jul-2015

783 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Paytm labs soyouwanttodatascience

So you want to data science.

Adam Muise

Chief Architect

Page 2: Paytm labs soyouwanttodatascience

Who am I?!•  Chief Architect at Paytm Labs!

•  Paytm Labs is a data-driven lab founded to take on the really hard problems of scaling up Fraud, Recommendation, Rating, and Platform at Paytm!

•  Paytm is an Indian Payments/Wallet company, has 50 Million wallets already, adds almost 1 Million wallets a day, and will be greater than 100 Million customers by the end of the year. Alibaba recently invested in us, perhaps you heard. !

•  I’ve also worked with Data Science teams at IBM, Cloudera, and Hortonworks!

Page 3: Paytm labs soyouwanttodatascience

Paytm!

Page 4: Paytm labs soyouwanttodatascience

This presentation is short so that you can ask a lot of questions.!

Page 5: Paytm labs soyouwanttodatascience

Wisdom Nuggets…!

Page 6: Paytm labs soyouwanttodatascience

The Leadership!

Page 7: Paytm labs soyouwanttodatascience

The Leadership!

If you are creating a data science team, chances are that you are not a Data Scientist. Data Scientists are best applied to the problems of data, not management.!

Page 8: Paytm labs soyouwanttodatascience

The Leadership!Your boss (should ask): Why do you even data science to solve the problem?!

You (should) answer: The problem is too complex to solve without machine learning. Here’s why.!

You (should not) answer: Big data and data science is on the roadmap.!

Page 9: Paytm labs soyouwanttodatascience

The Leadership!

You have your budget for a team of 2 data scientists. That’s a good start right? Get ready to ask for more money. !

Page 10: Paytm labs soyouwanttodatascience

The Leadership!You need to ask your management for:!

-  Budget for 2 data engineers for every data scientist you hire!

-  Access to the data lake, failing that, access the data warehouse!

-  DevOps!

-  Time to gain domain expertise before producing results!

-  Exec-level cooperation from those teams who own the data and tools you need and those who understand the data you need!

-  A budget for servers/tools/additional storage based on a TCO calculation you already did (right?)!

-  A dedicated place for your team to work!

Page 11: Paytm labs soyouwanttodatascience

The Leadership!Got DataLake?!!No? Depending on your problem space, chances are you are building one unless you can pull what you need from an Existing Data Warehouse.!

Page 12: Paytm labs soyouwanttodatascience

The Leadership!You didn’t do a TCO (Total Cost of Ownership) calculation? Ok, here you go:!

1.  Internal/External cloud instances that can run Spark/Hadoop/etc!

2.  Storage costs (S3, internal, etc) for your analytical data sets!

3.  Lead time to get started, something like 1-2 months depending on the complexity of the problem (Fraud might take 3 months whereas Recommendation Engines might be 1 month)!

4.  Training time and costs for tools you didn’t know you needed!

What! How much!

24-32 medium to large instances on AWS each

month!

$15,000 to $45,000 per month!

Storage costs for S3 (400TB to 2PB)!

$12,000 to $57,000 per month!

Salaries & Operating Expenses!

2 x $xxxxx your operating costs including salaries for

yourself and 3 people!

Training!(Courses for Tools and

perhaps a conference trip for hiring)!

$5,000 to $15,000!

Page 13: Paytm labs soyouwanttodatascience

The Team!

Page 14: Paytm labs soyouwanttodatascience

The Team!

So you have permission, resources, and a corner in an office. How do you start? !

Page 15: Paytm labs soyouwanttodatascience

The Team!Assemble your team in the following order:!1. Get a Data Engineer with a good analytical mind. Have him beg, borrow, or steal whatever data sets that might be applicable to the problem. Without data, no data sciencey stuff can happen.!

Page 16: Paytm labs soyouwanttodatascience

The Team!Assemble your team in the following order:!2. While you are getting your data, hire or recruit an internal Data Scientist. !Easy, right?!

Page 17: Paytm labs soyouwanttodatascience

!!!!!!WARNING!!!!!!!Data Science is not a mystical art form handed down by monks and taught over 50 years. You just need:!

•  a good math background!

•  academic or job experience with machine learning !

•  business context!

•  understand how to code. !

That can be easier to find than you think. !

!

That being said, everybody seems to think they are data scientists these days, from the guy who writes the monthly SQL reports to your office manager who is a wiz at excel. !

Page 18: Paytm labs soyouwanttodatascience

The Team!Assemble your team in the following order:!3. More Data Engineers. !4. DevOps support (if you don’t have a common resource pool to draw from).!

Page 19: Paytm labs soyouwanttodatascience

The Team!Keep your data science team innovative, keep them away from bureaucracy, keep them cool. Don’t discount the cool factor.!They are supposed to solve hard problems, not deal with the everyday business issues. To objective they need to be decoupled from the emergencies and mediocre. !If that sounds elitist then I challenge you to create a scaling fraud detection system with your existing data warehouse team. No really, try it. !

Page 20: Paytm labs soyouwanttodatascience

The Team!What will they do?!

The Data Engineer !

Your data engineer is the heart and sole of your data science team and will get almost none of the credit in the end. They will help build your data pipeline, perform data transformations, optimize training, automate validation, and take the results into production. !

If you are lucky, you have Data Scientists that respect this role and will often take some of these roles on to help ensure their vision reaches production. Instead of relying on luck, you can hire this way too. !

Page 21: Paytm labs soyouwanttodatascience

The Team!What will they do?!

The Data Scientist!

Your Data Scientist will explore the data, create models, validate, explore the data again, go in a different direction, clarify requirements, model again, validate, retract, and then produce a good model. The process is not deterministic and is a mix of research and implementation. A good Data Scientist will be able to code in the tools that you intend to go implement production code with, something like Scala in Spark. !

Your Data Scientist will have or at least learn the business context required to solve your problem. They will need to communicate with business experts to validate their solutions actually solve the problem or to help drive them in a new direction. !

Page 22: Paytm labs soyouwanttodatascience

The Team!What will they do?!DevOps!Developer Operations will help build that data pipeline for you. If you have to build a Data Lake from scratch, you are going to really rely on these folks. They should be elite, understand distributed systems, ride a motorcycle, and be someone you feel uncomfortable standing next to in an elevator.!

Page 23: Paytm labs soyouwanttodatascience

Managing The Team!

If your Data Scientists are not stellar coders, put a Data Engineer in their grill and make them produce code. They can’t contribute if they can’t get their hands dirty. Data Science is not an ivory tower. !

Page 24: Paytm labs soyouwanttodatascience

Managing The Team!Introduce your team to the business team that knows the data or business processes better than anyone else. Often that’s not the CIO-favored DWH team, but rather the Customer Service Representatives*!*This was especially true in fighting Fraud. !

Page 25: Paytm labs soyouwanttodatascience

Managing The Team!Ways to make your team hate you:!

Data Scientists:!

•  Don’t provide the data they need to create their models!

•  Suggest that they create their own training data, from scratch!

•  Provide ambiguous goals for the accuracy and precision of their models!

•  Tell them to mine the data / don’t’ have a plan!

•  Don’t respect the time it takes to create a model!

Data Engineers:!

•  Let the Data Scientists use whatever tool they want without respect to parallel processing or implementation!

•  Have no management control over your data sources!

DevOps:!

•  Use anything by IBM, Microsoft, SAS, or Oracle in your pipeline!

•  Let the Data Engineers decide on the infrastructure!

!

Page 26: Paytm labs soyouwanttodatascience

The Work!

Page 27: Paytm labs soyouwanttodatascience

The Work!Start out with a clear that is unambiguous. !“I want to detect and prevent 50% of Fraud in my payments system”!“I want to increase conversion rates in my eCommerce platform by 20%”!

Page 28: Paytm labs soyouwanttodatascience

The Work!

Get as much of the raw data as soon as you can and as fast as you can. Don’t have a Data Lake? Get your Hadoop on ASAP. !!

Page 29: Paytm labs soyouwanttodatascience

The Work!

Give the team time to research the data, gain context and become experts. !!

Page 30: Paytm labs soyouwanttodatascience

The Work!Data without context == a complete lack of direction in research. !Research needs constant checks to ensure that the primary problem is being solved. !!

Page 31: Paytm labs soyouwanttodatascience

The Work!Data Science Development != Engineering Software Development.!You will have to separate your research process from the engineering process that delivers the models to production. !!

Page 32: Paytm labs soyouwanttodatascience

The Work!Data Engineering is an ongoing process. You will need to maintain pipelines, adapt to schema changes, implement data cleansing, maintain metadata in the data lake, optimize processing workflows, etc. You will never outgrow the need for your Data Engineers. !!

Page 33: Paytm labs soyouwanttodatascience

The Architecture!

Page 34: Paytm labs soyouwanttodatascience

The Architecture!Start with the cloud. You need to get your infrastructure up as quickly as possible. At the beginning, this is cheaper than you think compared the time and startup costs for creating an on-premise data lake, even/especially if you have an existing IT Team*!!*If you are big corporation your IT team is often the biggest barrier to your success in creating an independent Data Science team.!

Page 35: Paytm labs soyouwanttodatascience

The Architecture!We had to build a data lake. It looks like this:!!

Page 36: Paytm labs soyouwanttodatascience

The Architecture!Lambda Architecture!Batch Ingest:!

•  SQOOP from MySQL instances!

•  Keep as much in HDFS as you can, offload to S3 for DR/Archive and when you have colder data!

•  Spark and other Hadoop processing tools can run natively over S3 data so it’s never really gone (don’t use Glacier in a processing workflow)!

Realtime Ingest:!

•  Mypipe to get events from binary log data and push into Kafka topics (under construction)!

•  VoltDB connector to get events from DB and push to Kafka (under construction)!

•  Streaming data piped through Kafka!

•  All Realtime data processed with Spark Streaming or Storm from Kafka!

Page 37: Paytm labs soyouwanttodatascience

The Architecture!As you grow, your processing and storage needs will likely mature. Consider moving to on-premise solution for your Hadoop/Processing architecture. You can always archive to S3 if you need DR and don’t have the appetite to create two clusters.!

Page 38: Paytm labs soyouwanttodatascience

The Architecture!

With an on-premise architecture, you can interact with existing on-premise production systems quickly. For us, that means real-time Fraud detection and action. You may find yourself maintaining both in the long run.!

Page 39: Paytm labs soyouwanttodatascience

What Actual Data Science looks like…

Page 40: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Supervised learning vs Anomaly detection ๏  Very small number of positive

examples

๏  Large number of negative examples.

๏  Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.

40

๏  Ideally large number of positive and negative examples.

๏  Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

* Anomaly Detection - Andrew Ng - Coursera ML Course

Page 41: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

What approach to follow? ๏  Not so good: One model to rule them all

๏  Better:

๏  Many models competing against each other

๏  100s or 1000s of rules running in parallel

๏  Know thy customer

41

Page 42: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Feature Selection ๏  Want p(x) large (small) for normal examples, "

p(x) small (large) for anomalous examples

๏  Most common problem: " comparable distributions for both normal and anomalous examples

๏  Possible solutions:

๏  Apply transformation and variable combinations:

๏  xn+1 = ( x1 + x4 ) 2 / x3

๏  Focus on variable ratios and transaction velocity

๏  Use deep learning for feature extraction

๏  Dimensionality reduction

๏  your solution here

42

Page 43: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Feature Selection

43

Page 44: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Feature Selection

44

Variable X

Coun

ts BKG SIG

Page 45: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

What have we have tried ๏  Density estimator

๏  2D Profiles

๏  Anomaly detection

๏  Clustering

๏  Model ensemble (Random forest)

๏  Deep learning (RBM)

๏  Logistic Regression

45

Combine

Page 46: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Gaussian distribution

46

Page 47: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Anomaly Detection* - Example ๏  Choose features, xi , that are indicative of anomalous examples.

๏  Fit parameters to a normal distribution

๏  Given new example, compute:

๏  Anomaly if

47

* Anomaly Detection - Andrew Ng - Coursera ML Course

Page 48: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Algorithm Evaluation ๏  Fit model on training set

๏  On a cross validation/test example, predict

๏  Possible evaluation metrics:

๏  True positive, false positive, false negative, true negative

๏  Precision/Recall

๏  F1-score

48

Page 49: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Implementation

49

Page 50: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Anomaly Detection*

50

* Anomaly Detection - Andrew Ng - Coursera ML Course

Cross validation set: Test set:

Assume we have some labeled data, of anomalous and non-anomalous examples: y = 0 if standard behaviour, . y = 1 if anomalous. Training set: "(assume normal examples/not anomalous)

Page 51: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Transform, Normalize, Calculate

51

Page 52: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Scala

52

Page 53: Paytm labs soyouwanttodatascience

Creating Scalable Architecture

Futures!

Page 54: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

The lake again

54

Lake Simcoegoing on

Lake Superior

Classic LambdaArchitecture

VariousProcessingFrameworks

Near-RealtimeScoring/Alerting*

Page 55: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Fraud Capabilities and Technology

A.  Batch Ingest and Analysis of transaction data from Database

B.  Batch Behavioural and Portfolio heuristic fraud detection

C.  Near-realtime anomaly and heuristic fraud detection

D.  Online Model Scoring

55

A.  Traditional ETL tools for transfer, HDFS/S3 for storage, Spark for processing

B.  Model analysis with iPython/Scala Notebook, Spark for processing, HDFS/HBase/Cassandra for storage

C.  Kafka real-time ingest, introduce Storm/Spark Streaming for near-realtime interception of data, HBase for model/rule storage and lookup

D.  JPMML/Spark Streaming for realtime model scoring

Page 56: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez

Our framework shopping list

56

iPython & Scala

Notebooks

Explore & Train Ingest, Store, Score, & Act

Spark::Core ::MLLib

::Streaming ::GraphX?

Intercept with Storm?

Spark Streaming?

Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3

OpenScoring?

JPMML?R?

Page 57: Paytm labs soyouwanttodatascience

[email protected] - @jabenitez57

Fin