bigquery, looker and big data analytics at...
TRANSCRIPT
Mark Rittman, Independent Analyst + Product Manager
BIGQUERY, LOOKER AND BIG DATA ANALYTICS
AT PETABYTE-SCALE
BUDAPEST DATA FORUM
May 2017
•Mark Rittman, Independent Analyst for Big Data Analytics
•Currently working with Qubit as Analytic Product Manager
•20 years in the BI, DW, ETL and now Big Data industry
•Implementor, CTO, company founder and author
•On Twitter at @markrittman
•Linkedin at https://uk.linkedin.com/in/markrittman
•[email protected] and http://www.mjr-analytics.com
About the Presenter
2
•Responsible for building + managing an analytics
product on personalization platform for marketers
•Operates in same market as Adobe Marketing
Cloud, Google Analytic 360, Optimizely, Monetate
•Real-time ingest of 10TB+/day of web activity data,
used for personalization
•Built on Looker BI tool and Google Cloud Platform
Current Role - Analytics PM for Marketing Tech Startup
3
Big Data Analytics on Google Cloud Platform
4
Cloud Big Data Analytics 2.0
5
Also use as my personal dev platform1TB of BigQuery query usage/month free
•Started back in 1996 on a bank Oracle DW project
•Our tools were Oracle 7.3.4, SQL*Plus, shell scripts
•Data warehouses provided a unified view of the business
•Single place to store key data and metrics
•Joined-up view of the business
•Aggregates and conformed dimensions
•ETL routines to load, cleanse and conform data
•BI tools for simple, guided access to information
Before This … 20 Years in Traditional DW Consulting
7
GOOGLE BIG QUERY
AND DISTRIBUTED, CLOUD COMPUTE + STORAGE
•New generation of big data platform services from Google, Amazon, Oracle
•Combines three key innovations from earlier technologies:
•Organising of data into tables and columns (from RDBMS DWs)
•Massively-scalable and distributed storage and query (from Big Data)
•Elastically-scalable Platform-as-a-Service (from Cloud)
Elastically-Scalable Data Warehouse-as-a-Service
9
•And things come full-circle … analytics
typically requires tabular data
•Google BigQuery based-on DremelX
massively-parallel query engine
•But stores data columnarprovides SQL
interface
•Solves the problem of providing DW-like
functionality at scale, as-a-service
BigQuery : Big Data Meets Data Warehousing
10
Google Cloud Platform
11
Cloud Dataflow - A fully managed, auto-scalable service for
pipeline data processing in batch or streaming mode
BigQuery - A fully managed, petabyte scale, low-cost enterprise
data warehouse for analytics
BigTable - A fully managed, petabyte scale, low-latency, high-
throughput wide column store for analytics
Cloud Pub/Sub - A fully managed, global and scalable publish
and subscribe service with guaranteed at-least-once message delivery
Google Cloud Platform Big Data Reference Architecture
12
All delivered as auto-scaling fully-managed services
Google BigQuery - Key Platform Technologies
13
Dremel - Massively-parallel real-time
query engine with SQL + REST API
Collosus - Distributed Storage layer
for Dremel, successor to GFS/HDFS
Capacitor - Columnar nested +
compressed file format for Dremel,
inspiration for Parquet etc
Borg - Large-scale cluster management,
runs Dremel jobs on 10,000+ servers
Jupiter - Google’s high-capacity
low-latency internal network
•BigQuery is a distributed, column-store query engine
•Denormalize your tables where possible as joins are relatively expensive
•Optimal query is one that filters, aggregates and selects subsets of columns
•Use table partitioning so that full scans of whole columns are minimized
Data Modeling with BigQuery
14
•BigQuery is a distributed, column-store query engine
•Denormalize your tables where possible as joins are relatively expensive
•Use nested repeated fields for dimension lookups to align with Capacitor storage
•Optimal query is one that filters, aggregates and selects subsets of columns
•Use table partitioning so that full scans of whole columns are minimized
BigQuery Storage Formats + Data Modeling
15
SELECT
zipcode, count(zipcode_trips.trip_count) as total_trips,
zipcode_incidents.call_type,
count(zipcode_incidents.call_type) as total_calls
FROM
`aerial-vehicle-148023.personal_metrics.sf_nested`
LEFT JOIN UNNEST (trips) as zipcode_trips
LEFT JOIN UNNEST (incidents) AS zipcode_incidents
WHERE
zipcode_incidents.call_type = 'Traffic Collision'
GROUP BY 1,3
Query Results in Seconds with No Indexes or
Summary Tables
BI + ANALYTICS TOOLING FOR GOOGLE BIGQUERY
AND DISTRIBUTED, CLOUD COMPUTE + STORAGE
BI and Analytics Tools for Google Cloud Platform
18
Looker for BI and Analytics -BI for Data Engineers, Semantic Models and LookML
LookML : BI Modeling for Data Engineers
20
• Query building using business semantic model
• Self-Service data analytics with agile dev model
• Dashboards, reports (“looks”), action links,
scheduling
Mark Rittman, Independent Analyst + Product Manager
BIGQUERY, LOOKER AND BIG DATA ANALYTICS
AT PETABYTE-SCALE
BUDAPEST DATA FORUM
May 2017