dealing with drift - building an enterprise data lake
TRANSCRIPT
Speakers
Nathan Swetye
Sr. Manager of Platform Engineering
Cox Automotive
Michael Gay
Lead Technical Architect
Cox Automotive
Pat Patterson
Community Champion
StreamSets
3
25 (and growing) companies dealing with the automotive space
Spans the full vehicle ownership lifecycle
Data perceived as the integration point for all companies
Cox Automotive
Enterprise Data DNA
Commercial Customers Across Verticals
150,000 downloads40 of the Fortune 100Doubling each quarter
Strong Partner Ecosystem Open Source Success
Mission: empower enterprises to harness their data in motion.
StreamSets Overview
StreamSets Data Collector™
StreamSets Dataflow Performance
Manager (DPM™)
Instrumented, open source UI and engine to build any-to-any
dataflows.
Cloud Service to map, measure and master dataflow
operations.
DATAFLOW LIFECYCLE
Developers
Scientists
Architects
StreamSets Enterprise
EVOLVE (Proactive)
REMEDIATE (Reactive)
DEVELOP OPERATE
Operators
Stewards
Architects
EFFICIENCYIntent Driven FlowsBatch & Streaming IngestIn-stream Sanitization
CONTROLFine-grained Stage & Flow MetricsDrift HandlingLineage and Impact Analysis Capture
AGILITYFlexible deploymentException HandlingSeamless Evolution
StreamSets Data Collector is a complete IDE for building and executing any-to-any ingest pipelines.
StreamSets Data Collector
StreamSets DPM provides a single pane of glass to map, measure and master your dataflow operations.
MASTERAvailability & AccuracyProactive Remediation
MEASUREAny PathAny Time
MAPDataflow LineageLive Data Architecture
StreamSetsDataflow Performance Manager (DPM)
Data DriftChange is the New Normal
The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and
modernization of the systems that produce the data
Structure Drift
SemanticDrift
Infrastructure Drift
SQL on Hadoop (Hive) Y/Y Click Through Rate
80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis
Example: Data Loss and Corrosion
Data Drift and Scale
At the micro level, data drift leads to breakage and errors
At the macro level, data drift brings your system to a grinding halt!
11
The Problem of Data Exchange at ScaleEveryone wants each others’ data, but often difficult to acquire
A tangled mess of data flow
A source of anguish and sorrow
12
The Problem of Data Exchange at ScaleEnter the Data Lake
The central store for valuable data
Mission: Data Lake, not Data Swamp
Data$Lake
13
Great. A Data Lake. But how do you Populate it?
Problem: $$ Cost – a Question of Scale• 25 Companies• 9+ Source Types, mostly DBs• 1-Many Schemas per Database• Many Tables per Schema
Example:• AutoTrader -> Oracle -> ATM1:
~1600 Tables
14
Great. A Data Lake. But how do you Populate it?
Problem: $$ Cost – a Question of Scale• 25 Companies• 9+ Source Types, mostly DBs• 1-Many Schemas per Database• Many Tables per Schema
Example:• AutoTrader -> Oracle -> ATM1:
~1600 Tables
We’ve ingested about that much
18
Cox Automotive’s StreamSets Architecture
Databases
Amazon S3
Files
FTP
Sources
StreamSets
Acquisition
StreamSets
StreamSets
StreamSets
Hadoop Filesystem
Big Data SQL
Amazon S3
Targets
StreamSets
Ingestion
StreamSets
StreamSets
StreamSets
Data Pipelines
Separates Acquisition from Ingestion
Dynamic Error Handling
Encrypted Data in Transit
Data standards applied automatically:
• Compression• File Formats• Partitioning Schemes• Row-level Watermarks• Time-stamping
Ingestion farm scales with demand
Auto-creates schemas en route
Data comes from a variety of sources
Pipelines are established for each source
Ingestion Back Pressure
Scaling, Secure,load-balanced
Actual ingestion activities
On-premises and Cloud Big Data
Systems
StreamSets
RPC
StreamSets
StreamSets
StreamSets
Load
Bal
ance
r
19
Acquisition Deployment Model
Ingest Form
StreamSets
Pipeline Deployment
Virtual HostDeployment
IngestionTeam Member
StreamSets
AcquisitionPipeline
Enterprise Data Lake
start workflow
submit form
start workflow
build virtual host
deploy data pipeline
Enterprise Data Sources
DevOpsTeam Member
20
Throughput!
0
100
200
300
400
Jan Feb Mar Apr May Jun Jul Aug Sept
Monthly Ingestion RequestsStreamSets
7x
25
Where do we go from Here?
• Amazon Web Services• StreamSets Dataflow Performance Manager• Acquire/Ingest decision point: Centralized, Federated, or Democratized?• Quality• Streamline access to sources• Change data capture• Integration with enterprise data catalogs• Ingestion post-processing