integrating apache spark and nifi for data lakes
TRANSCRIPT
MAKING BIG DATA COME ALIVE
Integrating Apache Spark And NiFiFor Data Lakes
Ron Bodkin Founder & PresidentScott Reisdorf R&D Architect
2
Agenda• Requirements• Design• Demo
3
• A central repositorywith trusted,consistent data
• Reduce costs by offloading analyticalsystems and archiving cold data
• Derive value quicklywith easier discoveryand prototyping
• A laboratory for experimenting withnew technologiesand data
Goals for a Data Lake
4
• Automation of pipelines with metadata and performance tracking
• Governance withclear distinction ofroles and responsibilities
• SLA tracking withalerts on failures orviolations
• Interactive data discovery and experimentation
What’s Needed For A Hadoop Data Lake?
5
Example Ingestion Project
• 4000+ unique flat files and RDMS tables, plus a few streaming data feeds• Mix of incremental and snapshot data• Ingest into Hadoop (minimally HDFS and Hive tables)• Cleansing/encryption and data validation• Metadata capture
Focus shifts over time from data ingestion to transformation then to analytics
6
Design
7
Apache Spark Functions•Cleanse• Validate• Profile•Wrangle
8 05/03/2023© 2016 Think Big, a Teradata Company
Pipeline design with Apache• Visual drag-and-drop • Dozens of data connectors• 150+ pre-built transforms• Data lineage• Batch and Streaming• Extensible
9
Role separation
• IT Designers design models in NiFi• Register with framework• Integrated development
process© 2016 Think Big, a Teradata Company 05/03/2023
Apache NiFi Think Big framework
• Users configure new feeds• Based on common model• Generated and executed in NiFi
register
deploy
101005/03/2023
© 2015 Think Big, a Teradata Company
User features around
org. roles
Visual design
Streaming and Batch
Fully governed
Integrated Best
Practices
Secure, modern
architecture
Design Approach
Will be open source (Apache
license)
1111
Ingest and Prepare
• UI-guided feed creation• Data protection• Data cleanse• Data validation• Data profiling• Powered by Apache Spark
Unpack and/or merge small files
Put file HDFS
Cleanse/Standardize
Spark
Data ProfileSpark
Metadata
ValidateSpark
Data Ingest Model
Metadata determines behavior of individual componentsAdds many Hadoop-specific higher-level NiFi processors
Index TextElasticsearch
Merge / DedupeHive
Compress & Archive Originals
HDFS,S3
Extract Table JDBC
Get File(s)Filesystem
MessageJMS/Kafka
OtherHTTP/REST, etc.
Data policies
12
1313
Data self-service and “wrangle”
• Graphical SQL builder• 100+ transform functions• Machine learning• Publish and schedule• Powered by Apache Spark
1414
Data Discovery
• Google-like searching • Extensible metadata• Data profile • Data sampling
1515
Operations
• Dashboard• Health Monitoring• Data Confidence• SLA enforcement• Alerts• Performance
reports
16
• Powerful search capabilities for users against data(think Google-like searching)
• NiFi processor extracts source data from Hadoop tablefor indexing in ElasticSearch
• Incremental updates during ingest
ElasticSearch – Full Text Indexing
Data Lakeselect id,user,tweetfrom twitter_feed
extract JSON
17
Demo
1818