hadoop graph processing with apache giraph
Post on 22-Sep-2014
23 views
DESCRIPTION
PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.TRANSCRIPT
June, 2013
Jay Tang
GRAPH MINING WITH APACHE GIRAPH
Confidential and Proprietary2
• Introduction
• Big Data problem
• Graph mining platform
• Use case
• Lessons
• Future work
AGENDA
Confidential and Proprietary3
• Director of Big Data Platform & Analytics, PayPal
− Hadoop, Graph mining, Real-time analytics, ML, text mining
• 20 years of software experience in the valley focused on data
• Member of original Hadoop team @Yahoo
• Built data warehouse, relational database, OLAP product @Yahoo, Oracle/Hyperion, IBM Informix, DB2
ABOUT ME
Confidential and Proprietary4
BIG DATA PROBLEM
Confidential and Proprietary5
• Enable Online, Offline, and Mobile payment
• 128M customers worldwide
• $160B payment volume processed annually
• Major retail locations accepting PayPal
20K today 2M end of 2013
• PayPal Here launching in US and international markets
Petabye Data Problem & Growing
BIG DATA PROBLEM @ PAYPAL
Confidential and Proprietary6
• Detect and prevent fraud
• Assess credit risk
• Relevant offer to our customers
• Improve user experience
• Provide better insights to our merchants
BIG DATA POWERS PAYPAL ANALYTICS
Confidential and Proprietary7
GRAPH MINING PLATFORM
Confidential and Proprietary8
BIG DATA STACK
DataCloud
Confidential and Proprietary9
Traditional data processing abstraction -- TABLE
• Rows
• Columns
• Data Types
DATA ABSTRACTION
Confidential and Proprietary10
• Internet & WWW
• Social network
• PayPal payment network – accounts & transactions
GRAPH IS EVERYWHERE
Confidential and Proprietary11
• Think like a vertex
• Two basic operations
− Fusion: aggregate information from neighbors to a set of entities
− Diffusion: propagate information from a vertex to neighbors
GRAPH COMPUTING
Confidential and Proprietary12
THING LIKE A VERTEX - FUSION
Confidential and Proprietary13
THINK LIKE A VERTEX - DIFFUSION
Confidential and Proprietary14
• Which graph mining engine to use?
− GraphLab
− Apache Giraph
− Apache Hamas
• Hadoop compatible
− Data is on Hadoop
− Leverage existing cluster infrastructure
− Integration with Hadoop
• Easy of deployment and update
• Community
GRAPH MINING ENGINE
Confidential and Proprietary15
• Apache open src implementation of Google Pregel on Hadoop
• Send msg from a vertex to any other vertex
• In-memory scalable system
− Map-only jobs, Zookeeper, Netty
BSP & GIRAPH
Confidential and Proprietary16
GRAPH MINING USE CASE
Confidential and Proprietary17
• Stop fraudsters from stealing money from PayPal payment network
• Sophisticate risk models running in real-time based on
− Online data
− Offline data
• Risk profile traditionally based on a variety of data
− Account
− Transaction -- frequency, amount, history
− IP
− Email domain
RISK DETECTION & MITIGATION
Confidential and Proprietary18
RISK COMPUTATION
Current TX Details
Risk Models
Approve
DeclineHistory Data
Confidential and Proprietary19
• PayPal data are connected
• Form multiple communities that have hidden inferences
• Discover the inferences via a graph approach
• Build a system to extract the inferences
GRAPH MINING CONNECTED DATA
Confidential and Proprietary20
GRAPH VIEW OF DATA
User1
User2
Merchant
BUY
BUY
P2P Money Transfer
Confidential and Proprietary21
GRAPH VIEW OF DATA
Account 1
IP1 IP2
Account 2
IP3
Confidential and Proprietary22
GRAPH MINING DATA PIPELINE
Pre Processing
Graph Processing
Post Processing
Giraph
MapReduce
MapReduce
Confidential and Proprietary23
• Input data is raw transaction data
• Custom MapReduce jobs to pre-process data into graph model
• Output is JSON format of adjacent node list
− Easy to consume in Java and by humans
− Use gson library
• Post processing – output format conversion
GRAPH DATA PIPELINE
Confidential and Proprietary24
• Customers/Accounts linked via transactions
• Compute risk = intrinsic risk + risk propagated from peers
• Send risk message to peers
• Iterate till converge
GRAPH PROCESSING
Cus1
Cus2
Transaction T1
Transaction T0
Transaction T2
Transaction T3
Confidential and Proprietary25
IP3
IP2
GRAPH PROCESSING
Account 1
IP1 IP2
Account 2
IP3IP1
Confidential and Proprietary26
LESSONS LEARNED
Confidential and Proprietary27
• Giraph is an emerging technology
− Incubation in 2012
− Rapidly evolving
− 0.1 and 0.2 are not compatible
− Lack of knowledge & doc
• Build internal git repo
• Read code and join mailing list
• Port code from 0.1 to 0.2
• Use Giraph 1.0 released on May 6 2013
GIRAPH
Confidential and Proprietary28
• Must guarantee minimum number of Mappers
• Capacity scheduler
− set MIN mapper of queue > Giraph job needs
• Fair scheduler
− set MIN mapper of queue > Giraph job needs
− Turn on pre-emption
− Set pre-emption wait time to a small interval – 20 sec
HADOOP ENVIRONMENT INTEGRATION
Confidential and Proprietary29
• Memory constraint in a shared Hadoop environment
− 1.2B edges and 300M nodes
− Single purpose POC cluster mapper memory = 10 GB
− Shared R&D cluster mapper memory = 3 GB
• Reduce memory consumption is key
− Convert String to long for graph processing
− Convert back to String in post-processing for downstream application
− Cap the number of messages passed
− distance from current vertex
− message payload data values
MEMORY SCALABILITY
Confidential and Proprietary30
• Giraph-based data engine to produce enriched data set
• Leverage Giraph on YARN
• Number of worker scalability
FUTURE WORK
Q&A
WE ARE HIRING