integrating apache spark and nifi for data lakes

MAKING BIG DATA COME ALIVE

Integrating Apache Spark And NiFiFor Data Lakes

Ron Bodkin Founder & PresidentScott Reisdorf R&D Architect

2

Agenda• Requirements• Design• Demo

3

• A central repositorywith trusted,consistent data

• Reduce costs by offloading analyticalsystems and archiving cold data

• Derive value quicklywith easier discoveryand prototyping

• A laboratory for experimenting withnew technologiesand data

Goals for a Data Lake

4

• Automation of pipelines with metadata and performance tracking

• Governance withclear distinction ofroles and responsibilities

• SLA tracking withalerts on failures orviolations

• Interactive data discovery and experimentation

What’s Needed For A Hadoop Data Lake?

5

Example Ingestion Project

• 4000+ unique flat files and RDMS tables, plus a few streaming data feeds• Mix of incremental and snapshot data• Ingest into Hadoop (minimally HDFS and Hive tables)• Cleansing/encryption and data validation• Metadata capture

Focus shifts over time from data ingestion to transformation then to analytics

6

Design

7

Apache Spark Functions•Cleanse• Validate• Profile•Wrangle

8 05/03/2023© 2016 Think Big, a Teradata Company

Pipeline design with Apache• Visual drag-and-drop • Dozens of data connectors• 150+ pre-built transforms• Data lineage• Batch and Streaming• Extensible

9

Role separation

• IT Designers design models in NiFi• Register with framework• Integrated development

process© 2016 Think Big, a Teradata Company 05/03/2023

Apache NiFi Think Big framework

• Users configure new feeds• Based on common model• Generated and executed in NiFi

register

deploy

101005/03/2023

© 2015 Think Big, a Teradata Company

User features around

org. roles

Visual design

Streaming and Batch

Fully governed

Integrated Best

Practices

Secure, modern

architecture

Design Approach

Will be open source (Apache

license)

1111

Ingest and Prepare

• UI-guided feed creation• Data protection• Data cleanse• Data validation• Data profiling• Powered by Apache Spark

Unpack and/or merge small files

Put file HDFS

Cleanse/Standardize

Spark

Data ProfileSpark

Metadata

ValidateSpark

Data Ingest Model

Metadata determines behavior of individual componentsAdds many Hadoop-specific higher-level NiFi processors

Index TextElasticsearch

Merge / DedupeHive

Compress & Archive Originals

HDFS,S3

Extract Table JDBC

Get File(s)Filesystem

MessageJMS/Kafka

OtherHTTP/REST, etc.

Data policies

12

1313

Data self-service and “wrangle”

• Graphical SQL builder• 100+ transform functions• Machine learning• Publish and schedule• Powered by Apache Spark

1414

Data Discovery

• Google-like searching • Extensible metadata• Data profile • Data sampling

1515

Operations

• Dashboard• Health Monitoring• Data Confidence• SLA enforcement• Alerts• Performance

reports

16

• Powerful search capabilities for users against data(think Google-like searching)

• NiFi processor extracts source data from Hadoop tablefor indexing in ElasticSearch

• Incremental updates during ingest

ElasticSearch – Full Text Indexing

Data Lakeselect id,user,tweetfrom twitter_feed

extract JSON

17

Demo

integrating apache spark and nifi for data lakes

Technology