![Page 1: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/1.jpg)
Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
Jason Pohl, Data Solutions Engineer Denny Lee, Technology Evangelist
![Page 2: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/2.jpg)
About the speaker: Jason Pohl
Jason Pohl is a solutions engineer with Databricks, focused on helping customers become successful with their data initiatives. Jason has spent his career building data-driven products and solutions.
2
![Page 3: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/3.jpg)
About the moderator: Denny Lee
Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).
3
![Page 4: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/4.jpg)
We are Databricks, the company behind Apache Spark
Founded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricks in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
![Page 5: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/5.jpg)
…
Apache Spark Engine
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
![Page 6: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/6.jpg)
![Page 7: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/7.jpg)
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
![Page 8: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/8.jpg)
![Page 9: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/9.jpg)
![Page 10: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/10.jpg)
Traditional Data Warehousing Pain PointsInelasticity of compute and storage resources
• Burst workloads requires max. load capacity planning
• Fixed size DW = compute and storage to scale linearly together
(these are orthogonal problems)
• Expensive conundrum:
• If your DW is successful, you cannot easily exapnd
• If there is overcapacity = idle resources
![Page 11: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/11.jpg)
Traditional Data Warehousing Pain PointsRigid architecture that’s difficult to change
• Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be
pre-built.
• Rigidity = maintaining costly ETL pipelines
• Expend finite resources to continually augment pipelines to absorb new data.
![Page 12: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/12.jpg)
Traditional Data Warehousing Pain PointsLimited advanced analytics capabilities
• Want more than what business intelligence and data warehousing provides
• More than just counts, aggregates and trends
• Desire forecasting using ML, segmentation, graph processing, etc.
![Page 13: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/13.jpg)
Just-in-Time Data WarehousingScale resources on demand
13
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
![Page 14: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/14.jpg)
Just-in-Time Data WarehousingDirect access to data sources
14
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
![Page 15: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/15.jpg)
Just-in-Time Data WarehousingScale resources on demand
15
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
![Page 16: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/16.jpg)
Change Data CaptureWhat is it?
• System to automatically capture changes in source system (e.g. transactional database) and automatically capture those changes in a target system (e.g. data warehouse). • Important for data warehouses because it allows it to record (and
ultimately report) any changes, e.g.: • Customer A buys a pair of skis for $250 on 1/2/2015 • On 1/5/2015, realize that the purchase was $350 not $250
16
![Page 17: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/17.jpg)
Change Data CaptureSource to Target
17
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
ID Date Product Price
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
![Page 18: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/18.jpg)
Change Data CaptureAdd new row
18
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
![Page 19: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/19.jpg)
Change Data CaptureUpdate an existing row
19
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
Target
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $350.00
103 1/3/2016 Disc $15.00
![Page 20: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/20.jpg)
Change Data CaptureUpdate an existing row
20
Source Target
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $350.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
102 1/2/2016 Skis $350.00 1/5/2016
![Page 21: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/21.jpg)
DemoHigh Watermark with LastUpdatedDate
21
![Page 22: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/22.jpg)
22
Stage Data from Employee Database
![Page 23: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/23.jpg)
23
Update Records in Employee Source Database
UPDATE employees SET last_name = 'Spark' WHERE emp_no = 16894
![Page 24: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/24.jpg)
Job to Automate CDC
24
Source Target
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
Jobs
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
![Page 25: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/25.jpg)
25
Add a column to the Departments table
ALTER TABLE departments ADD COLUMN dept_desc VARCHAR(50)
UPDATE departments SET dept_desc = dept_name
![Page 26: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/26.jpg)
Job to Automate CDC
Source Target
Jobs
dept_no
dept_name
dept_no
dept_name dept_no
dept_name dept_desc
![Page 27: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/27.jpg)
Notebooks
To access the notebooks, please reference the attachments in the Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read webinar.
• Stage Data From Employee Database: • Notebook that starts the process • Defines the ETL process
• Change Schema in Employee Source Database • Update Records in Employee Source Database • Validate Departments
![Page 28: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/28.jpg)
Resources
• Just-in-Time Data Warehousing Solution Brief • Building a Turbo-fast Data Warehousing Platform with
Databricks • Spark DataFrames: Simple and Fast Analysis of Structured Data • Transitioning from Traditional DW to Spark in OR Predictive
Modeling • Advertising Technology Sample Notebook (Part 1)
![Page 29: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/29.jpg)
More resources
• Databricks Guide • Apache Spark User Guide • Databricks Community Forum • Training courses: public classes, MOOCs, & private training • Databricks Community Edition: Free hosted Apache Spark.
Join the waitlist for the beta release!
29
![Page 30: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read](https://reader031.vdocuments.site/reader031/viewer/2022022414/587071931a28ab48378b7abb/html5/thumbnails/30.jpg)
Thanks!