hadoop at ayasdi
TRANSCRIPT
Overview
- HDFS for storage- YARN for integration into Hadoop data lake- Parquet as the file format- bigdf based on Spark for feature
engineering, data wrangling
!! Audience Poll !!
1. How many data scientists?2. How many backend engineers?3. UI/frontend engineers?4. Using hadoop in production?5. Using Spark in production?6. Personally worked on data bigger than 100GB? 1TB? 10TB?
1PB?
HDFS - Motivation
- installed base, large community- ecosystem to connect to most other data
sources- commodity cluster- experiments with distributed NAS didn't
show enough promise to justify the additional cost and complexity
YARN - Motivation
- Ayasdi scheduler- maximize throughput for batch jobs- minimize latency for interactive “tasklets”
- wanted to deploy in existing Hadoop data lakes
- integrated inhouse scheduler with YARN- “tasklets” get a long running container- batch jobs get a container on demand
YARN - Challenges
- increased latency observable for small batch jobs
- early adopter pains- sparse documentation- not the best API design
Parquet - Motivation
- legacy: data stored in both row and column major
- requires expensive transpose on ingestion- were designing a “tiled file format” when
discovered parquet
Parquet - Challenges
- early adopter challenges- sparse documentation- needed to access package private APIs
bigdf - Motivation
- born out of experience using spark for feature engineering
- creating classes for RDDs not reusable across projects
- SQL not expressive enough
bigdf - details- open source since Sep 2014- precedes Spark DataFrame, so built on spark-core
engine- experimenting with Catalyst using Spark DataFrame
APIs, looks promising- python and scala APIs- feature engineering library [not open source :-( ]- fast CSV reader(and other features) contributed to
spark-csv
bigdf - future
- wrapper around Spark DF - to protect from API changes- to add features e.g. “sparse column set” as “round-
trip time” for pull requests into large open source projects is high
Thanks!
www.ayasdi.comhttp://engineering.ayasdi.com
https://github.com/AyasdiOpenSource/bigdfhttp://www.ayasdi.com/company/careers/