data warehousing and analytic infrastructure at …

DATA WAREHOUSING AND ANALYTICINFRASTRUCTURE AT FACEBOOK

By Adam Kuns

HE PURPOSE OF ANALYTICAT FACEBOOK

Scalable analytics plays an impoat Facebook by providing the abianalyze large data sets to numerteams across the company.

Ad hoc analysis of data and the creation of business intelligence dashboards by analysts at Facebare made possible by the analyticinfrastructure.

It also allows the website to provifeatures like Insights for advertiseand also make friend recommendations to you.

THE AVERAGE WORKLOAD, ON AN AVERAGE WORK DAY, FOR THE NOT SOAVERAGE ANALYTICS INFRASTRUCTURE

Lots of people(and automated processescompany rely on running processes on Facebook’s large data sets.

On average, 10,000 jobs are submitted busers at Facebook, each query with its olevel of parallelism, resource needs and deadlines.

All this querying is done on a dataset thagrowing rapidly each day.

As of 2010, Facebook was loading 10-15compressed data a day into its data ware(60-90 TBs uncompressed).

THE BUILDING BLOCKS OFTHE FB DATA WAREHOUSE

Hadoop - Distributed File Syand map-reduce platform.

Hive - Data warehouse

infrastructure that provides

traditional/familiar tools suchSQL, meta data, and partitio

on top of Hadoop.

Scribe - Aggregation servicecollects logs from thousands

Facebook’s web servers.

APACHE HIVE

Hive was developed at Facebook for the purpose abstracting the lower-levelprogramming of Map-Reduinto familiar SQL-like querWhile Hadoop provides scalability, with Map-Reduproviding the ability to expmany kinds of jobs, the abto create quick queries in aproductive manner using MReduce is not possible by many users at Facebook.

HIVEQL

So to make up for this bottleneck in the analyticspipeline, Hive provides a framework for users to express queries in sometthat many of the users arfamiliar with: an SQL-likequery language, HiveQL.The concepts of tables, columns, and partitions aalso part of Hive.This allows users to creatqueries in minutes, wheresame task would take hours/days in Map-Reduc

=

HIVE OVERVIEW

In order to provide a relational datababstraction on top of Hadoop, tables

partitions in Hive are stored as HDF

directories in Hadoop.

This structural information is then st

the Hive Metastore.

With this information in the Metasto

Hive Driver can compile the HiveQL

into Map-Reduce jobs, applying

optimizations (such as file pruning b

on the query predicates, applying predicates, creating efficient executi

plans, re-arranging join orders).

D HOC QUERIES USING HIV

Most of the Ad Hoc queries at Facebook are either executed through either the Hive CommaLine Interface(CLI), or the Web(HiPal).

HiPal helps enable users who not very familiar with SQL to crqueries.

The results of any query are stfor 7 days allowing them to be shared with other users.

DATA FLOW ARCHITECTURE

Data from the web servers is pushed out to a set of

Scribe-Hadoop clusters.

These servers aggregate the logs coming from these w

servers. Data coming from these web servers is uncompressed, and is around 30TBs worth of data a d

This first path of data is usually bottlenecked due to the

sheer amount of uncompressed data coming from the servers.

DATA FLOW ARCHITECTURE(CONT.)

Every 5-15 minutes, processes run to compress and copy the din the format of HDFS files from the Scribe-Hadoop cluster to tProduction Hadoop-Hive cluster.

This data is then published either hourly or daily in the corresponding Hive tables so that users/processes can consum

The Facebook site data contained in federated MySQL tier getloaded to the Production Hadoop-Hive cluster daily through a scrape process, where it will try and retrieve the data from thedifferent MySQL servers, but after a certain number of retries, then just provide the previous days data.

DATA FLOW ARCHITECTURE(CONT.)

While the Production Hadoop Hive cluster is used to

execute jobs that have strict delivery deadlines, possibhaving users run inefficient queries could cause latenc

issues with the other jobs that have strict time constrai

So to solve for this issue, a separate Ad Hoc cluster is

used to execute lower priority batch jobs and user que

A Hive replication process checks for any changes mato the Hive tables in the Production cluster.

ISSUES: DATA DELIVERY LATENCY

s you can see on the board, there are varying latencies on th

various data flows.

ome processes like the Scribe to Hive transfer happen every

5 minutes, but other processes like the scrape process of the

MySQL cluster happens once a day.

dditionally, the data HDFS files coming from the Scribe Hado

erver are only being loaded into native Hive tables at the end

the day.

ISSUES: DATA DELIVERY LATENCY(CONT.)

o in order to provide users with the most recent data,

Facebook uses Hive’s external table feature.

his feature creates a schema in Hive for a particular tab

but instead of having the table’s data contained in Hive, xternal table feature allows Hive to point to the raw HDF

file.

his allows immediate access to the data before the natiHive table has been created at the end of the day.

ISSUES: STORAGE

hile the Production Hadoop Hive cluster only retains 1 month’s worth of data, the Ad Hoc cluster has to supporthistorical data for analysis.

his becomes a storage constraint as the incoming dataets get larger and the need for more historical data also

gets larger.

o help alleviate the storage issue, all data is compressesing the gzip codec, allowing a compression factor of a7.

ISSUES: SCALING HDFS NAMENODE

he NameNode is the master server that is responsible fthe mappings of files to blocks for HDFS.

As of 2010, the Hive Hadoop NameNode heap was

onfigured with 48GB memory, and around 100,000,000mapped blocks.

Along with the files produced by Hive, the amount of memory required by the NameNode is increasing.

ISSUES: SCALING HDFS NAMENODE(CONT.)

o help alleviate this, besides additional optimizations onhe NameNode structure, Facebook is concatenating HD

iles to reduce the number of mappings in the NameNod

lso to reduce the files returned from the HDFS on query

xecution, Hive generates execution plans that concaten

un a concatenation to reduce pressure on the NameNod

ISSUES: FEDERATION

With Facebook’s growing data sets, it might have to eventually consider federating dataacross multiple clusters, across multiple data centers.

Two approaches would be to federate the data based on date, or federate based on application(i.e. ads data sets contained in its own cluster).

Going the date route though, the size of new data will continue to grow getting larger alarger, so separating older data into a separate clusters would only up saving 20-25% sspace in the long run.

Going the application route would cause some overhead from replicating common dataamong clusters, but this overhead would be considered very low.

Federating on application would allow Facebook to balance out the storage and query workloads on different clusters more evenly.

ISSUES: DATA DISCOVERY

here are more than 20,000 tables in the Ad Hoc Hadooive cluster for users to query. This can create issues w

lmost all of the users (20,000 tables is a lot of tables to familiar with).

o help with this issue, Facebook has created tools that llow users to create tags for tables, processes that gath

formation from query logs to provide additional informa

bout tables, and processes that determine expert usersables so that other users can ask them questions about

certain data.

ISSUES: RESOURCE SHARING

he Hadoop Hive cluster supports both Ad Hoc queries and batch johere the ad hoc users want quick response times, and the batch jo

have to adhere to strict deadlines.

In order to support this, Facebook has been contributing to the HadoFair Share Scheduler.

The Fair Share Scheduler divides users into pools where they are gia limit of how many concurrent jobs they can execute.

he Fair Share Scheduler is also aware of system resources such aCPU utilization and memory usage. This allows the scheduler to kill tasks on a particular node if that node exceeds the memory thresholthat is set for that node.

ISSUES: RESOURCE SHARING(CONT

Even though separating the Production Hadoop-Hive clu

from the Ad-Hoc Hadoop-Hive cluster provides stability fthe Production cluster by ensuring ad hoc jobs don’t hog

PU processing from more critical jobs, there are times hen one server is idle and can take on more processing

Facebook is experimenting with a system called “Dynam

Clouds” which would allow jobs to be moved from clustecluster based on the workloads’ of the servers.

OPERATIONS

Facebook has added a lot of monitoring to its infrastructure in order maintain business critical functions.

CPU usage, I/O activity, and memory usage on the nodes is tracked. An application called HTOP also provides CPU and memory usage across a cluster per job, which helps identify resource hogging jobs.

Statistics are also collected from the Hadoop specific pieces of the infrastructure(JobTracker and the NameNode):

avg. job submission rate

# of utilized map and reduce slots

JVM heap utilization

Frequency of garbage collection

These measures have helped Facebook identify issues with the jobs/hardware quickly and remediate the issues.

THE END

data warehousing and analytic infrastructure at …

Documents