data distribution patterns with apache nifi
TRANSCRIPT
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Distribution Patterns with Apache NiFi
March 2016
Bryan Bende – Member of Technical Staff
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Overview
• Frequently Asked Question: “How do I distribute data across my NiFi cluster?”
• Answer: “It depends…”
• Typically based on whether the data is being pushed or pulled
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pushing to Listeners
NCM
Node 1
ListenHTTP
Node 2
ListenHTTP
Load Balancer
Data Producer
Data Producer
Data Producer
• Processors listening for incoming data on each node in the cluster
• Load balancer sitting in front of the cluster pointing at listeners
• Data producers push data to load balancer which distributes data across the cluster
• Same approach works for ListenSyslog, ListenUDP, and HandleHttpRequest
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pulling on All Nodes
NCM
Node 1
GetKafka
Node 2
GetKafka
Kafka Topic
• Works when data source ensures each request gets a unique piece of data
• Kafka sees each GetKafka processor as the same client
• Each GetKafka processor gets different data
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pulling With List & Fetch OperationsNCM
Node 1 (Primary)
ListHDFS
HDFS
FetchHDFS
RPG
Input Port
Node 2
ListHDFS
FetchHDFS
RPG
Input Port
• Perform “list” operation on primary node
• Send results to Remote Process Group pointing back to same cluster
• Redistributes results to all nodes to perform “fetch” in parallel
• Same approach for ListFile + FetchFile and ListSFTP + FetchSFTP
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Site-To-Site Push
NCM
Node 1
Input Port
Node 2
Input Port
Standalone NiFi
RPG
• Site-To-Site makes a direct connection between NiFi instances
• Either side can be a cluster or standalone instance
• In a push scenario, the source connects a Remote Process Group to an Input Port on the destination
• If pushing to a cluster, Site-To-Site takes care of load balancing across the nodes in the cluster
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Site-To-Site Pull
NCM
Node 1
RPG
Node 2
RPG
Standalone NiFi
Output Port
• In a pull scenario, the destination connects a Remote Process Group to an Output Port on the source
• Each node will pull different data from the source
• If the source was a cluster, each node would pull from each node in the source cluster
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
Several mechanism for distributing data across a cluster…• Push to Listeners with a load balancer in front
• Pull from a source that provides a queue
• Pull with a List + Fetch approach
• Push with Site-to-Site
• Pull with Site-to-Site
Contact Info• [email protected]
• Twitter - @bbende