data warehousing and analytic infrastructure at …
TRANSCRIPT
![Page 1: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/1.jpg)
DATA WAREHOUSING AND ANALYTICINFRASTRUCTURE AT FACEBOOK
By Adam Kuns
![Page 2: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/2.jpg)
HE PURPOSE OF ANALYTICAT FACEBOOK
Scalable analytics plays an impoat Facebook by providing the abianalyze large data sets to numerteams across the company.
Ad hoc analysis of data and the creation of business intelligence dashboards by analysts at Facebare made possible by the analyticinfrastructure.
It also allows the website to provifeatures like Insights for advertiseand also make friend recommendations to you.
![Page 3: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/3.jpg)
THE AVERAGE WORKLOAD, ON AN AVERAGE WORK DAY, FOR THE NOT SOAVERAGE ANALYTICS INFRASTRUCTURE
Lots of people(and automated processescompany rely on running processes on Facebook’s large data sets.
On average, 10,000 jobs are submitted busers at Facebook, each query with its olevel of parallelism, resource needs and deadlines.
All this querying is done on a dataset thagrowing rapidly each day.
As of 2010, Facebook was loading 10-15compressed data a day into its data ware(60-90 TBs uncompressed).
![Page 4: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/4.jpg)
THE BUILDING BLOCKS OFTHE FB DATA WAREHOUSE
Hadoop - Distributed File Syand map-reduce platform.
Hive - Data warehouse
infrastructure that provides
traditional/familiar tools suchSQL, meta data, and partitio
on top of Hadoop.
Scribe - Aggregation servicecollects logs from thousands
Facebook’s web servers.
![Page 5: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/5.jpg)
APACHE HIVE
Hive was developed at Facebook for the purpose abstracting the lower-levelprogramming of Map-Reduinto familiar SQL-like querWhile Hadoop provides scalability, with Map-Reduproviding the ability to expmany kinds of jobs, the abto create quick queries in aproductive manner using MReduce is not possible by many users at Facebook.
![Page 6: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/6.jpg)
HIVEQL
So to make up for this bottleneck in the analyticspipeline, Hive provides a framework for users to express queries in sometthat many of the users arfamiliar with: an SQL-likequery language, HiveQL.The concepts of tables, columns, and partitions aalso part of Hive.This allows users to creatqueries in minutes, wheresame task would take hours/days in Map-Reduc
=
![Page 7: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/7.jpg)
HIVE OVERVIEW
In order to provide a relational datababstraction on top of Hadoop, tables
partitions in Hive are stored as HDF
directories in Hadoop.
This structural information is then st
the Hive Metastore.
With this information in the Metasto
Hive Driver can compile the HiveQL
into Map-Reduce jobs, applying
optimizations (such as file pruning b
on the query predicates, applying predicates, creating efficient executi
plans, re-arranging join orders).
![Page 8: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/8.jpg)
D HOC QUERIES USING HIV
Most of the Ad Hoc queries at Facebook are either executed through either the Hive CommaLine Interface(CLI), or the Web(HiPal).
HiPal helps enable users who not very familiar with SQL to crqueries.
The results of any query are stfor 7 days allowing them to be shared with other users.
![Page 9: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/9.jpg)
DATA FLOW ARCHITECTURE
Data from the web servers is pushed out to a set of
Scribe-Hadoop clusters.
These servers aggregate the logs coming from these w
servers. Data coming from these web servers is uncompressed, and is around 30TBs worth of data a d
This first path of data is usually bottlenecked due to the
sheer amount of uncompressed data coming from the servers.
![Page 10: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/10.jpg)
DATA FLOW ARCHITECTURE(CONT.)
Every 5-15 minutes, processes run to compress and copy the din the format of HDFS files from the Scribe-Hadoop cluster to tProduction Hadoop-Hive cluster.
This data is then published either hourly or daily in the corresponding Hive tables so that users/processes can consum
The Facebook site data contained in federated MySQL tier getloaded to the Production Hadoop-Hive cluster daily through a scrape process, where it will try and retrieve the data from thedifferent MySQL servers, but after a certain number of retries, then just provide the previous days data.
![Page 11: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/11.jpg)
DATA FLOW ARCHITECTURE(CONT.)
While the Production Hadoop Hive cluster is used to
execute jobs that have strict delivery deadlines, possibhaving users run inefficient queries could cause latenc
issues with the other jobs that have strict time constrai
So to solve for this issue, a separate Ad Hoc cluster is
used to execute lower priority batch jobs and user que
A Hive replication process checks for any changes mato the Hive tables in the Production cluster.
![Page 12: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/12.jpg)
ISSUES: DATA DELIVERY LATENCY
s you can see on the board, there are varying latencies on th
various data flows.
ome processes like the Scribe to Hive transfer happen every
5 minutes, but other processes like the scrape process of the
MySQL cluster happens once a day.
dditionally, the data HDFS files coming from the Scribe Hado
erver are only being loaded into native Hive tables at the end
the day.
![Page 13: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/13.jpg)
ISSUES: DATA DELIVERY LATENCY(CONT.)
o in order to provide users with the most recent data,
Facebook uses Hive’s external table feature.
his feature creates a schema in Hive for a particular tab
but instead of having the table’s data contained in Hive, xternal table feature allows Hive to point to the raw HDF
file.
his allows immediate access to the data before the natiHive table has been created at the end of the day.
![Page 14: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/14.jpg)
ISSUES: STORAGE
hile the Production Hadoop Hive cluster only retains 1 month’s worth of data, the Ad Hoc cluster has to supporthistorical data for analysis.
his becomes a storage constraint as the incoming dataets get larger and the need for more historical data also
gets larger.
o help alleviate the storage issue, all data is compressesing the gzip codec, allowing a compression factor of a7.
![Page 15: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/15.jpg)
ISSUES: SCALING HDFS NAMENODE
he NameNode is the master server that is responsible fthe mappings of files to blocks for HDFS.
As of 2010, the Hive Hadoop NameNode heap was
onfigured with 48GB memory, and around 100,000,000mapped blocks.
Along with the files produced by Hive, the amount of memory required by the NameNode is increasing.
![Page 16: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/16.jpg)
ISSUES: SCALING HDFS NAMENODE(CONT.)
o help alleviate this, besides additional optimizations onhe NameNode structure, Facebook is concatenating HD
iles to reduce the number of mappings in the NameNod
lso to reduce the files returned from the HDFS on query
xecution, Hive generates execution plans that concaten
un a concatenation to reduce pressure on the NameNod
![Page 17: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/17.jpg)
ISSUES: FEDERATION
With Facebook’s growing data sets, it might have to eventually consider federating dataacross multiple clusters, across multiple data centers.
Two approaches would be to federate the data based on date, or federate based on application(i.e. ads data sets contained in its own cluster).
Going the date route though, the size of new data will continue to grow getting larger alarger, so separating older data into a separate clusters would only up saving 20-25% sspace in the long run.
Going the application route would cause some overhead from replicating common dataamong clusters, but this overhead would be considered very low.
Federating on application would allow Facebook to balance out the storage and query workloads on different clusters more evenly.
![Page 18: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/18.jpg)
ISSUES: DATA DISCOVERY
here are more than 20,000 tables in the Ad Hoc Hadooive cluster for users to query. This can create issues w
lmost all of the users (20,000 tables is a lot of tables to familiar with).
o help with this issue, Facebook has created tools that llow users to create tags for tables, processes that gath
formation from query logs to provide additional informa
bout tables, and processes that determine expert usersables so that other users can ask them questions about
certain data.
![Page 19: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/19.jpg)
ISSUES: RESOURCE SHARING
he Hadoop Hive cluster supports both Ad Hoc queries and batch johere the ad hoc users want quick response times, and the batch jo
have to adhere to strict deadlines.
In order to support this, Facebook has been contributing to the HadoFair Share Scheduler.
The Fair Share Scheduler divides users into pools where they are gia limit of how many concurrent jobs they can execute.
he Fair Share Scheduler is also aware of system resources such aCPU utilization and memory usage. This allows the scheduler to kill tasks on a particular node if that node exceeds the memory thresholthat is set for that node.
![Page 20: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/20.jpg)
ISSUES: RESOURCE SHARING(CONT
Even though separating the Production Hadoop-Hive clu
from the Ad-Hoc Hadoop-Hive cluster provides stability fthe Production cluster by ensuring ad hoc jobs don’t hog
PU processing from more critical jobs, there are times hen one server is idle and can take on more processing
Facebook is experimenting with a system called “Dynam
Clouds” which would allow jobs to be moved from clustecluster based on the workloads’ of the servers.
![Page 21: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/21.jpg)
OPERATIONS
Facebook has added a lot of monitoring to its infrastructure in order maintain business critical functions.
CPU usage, I/O activity, and memory usage on the nodes is tracked. An application called HTOP also provides CPU and memory usage across a cluster per job, which helps identify resource hogging jobs.
Statistics are also collected from the Hadoop specific pieces of the infrastructure(JobTracker and the NameNode):
avg. job submission rate
# of utilized map and reduce slots
JVM heap utilization
Frequency of garbage collection
These measures have helped Facebook identify issues with the jobs/hardware quickly and remediate the issues.
![Page 22: DATA WAREHOUSING AND ANALYTIC INFRASTRUCTURE AT …](https://reader034.vdocuments.site/reader034/viewer/2022051204/627814b2b563fa470705ac39/html5/thumbnails/22.jpg)
THE END