data wrangling and oracle connectors for hadoop

50
1 Wrangling Data With Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect [email protected] @gwenshap

Upload: chen-gwen-shapira

Post on 26-Jan-2015

123 views

Category:

Technology


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Wrangling and Oracle Connectors for Hadoop

1

Wrangling DataWith Oracle Connectors for Hadoop

Gwen Shapira, Solutions [email protected]@gwenshap

Page 2: Data Wrangling and Oracle Connectors for Hadoop

Data Has Changed in the Last 30 YearsDA

TA G

ROW

TH

END-USERAPPLICATIONS

THE INTERNET

MOBILE DEVICES

SOPHISTICATEDMACHINES

STRUCTURED DATA – 10%

1980 2013

UNSTRUCTURED DATA – 90%

Page 3: Data Wrangling and Oracle Connectors for Hadoop

Data is Messy

Page 4: Data Wrangling and Oracle Connectors for Hadoop

5

Data Wrangling (n):The process of converting “raw” data into a format that allows convenient consumption

Page 5: Data Wrangling and Oracle Connectors for Hadoop

6

Hadoop Is…

• HDFS – Massive, redundant data storage• Map-Reduce – Batch oriented data processing at scale

Hadoop Distributed File System (HDFS)

ReplicatedHigh Bandwidth

Clustered Storage

MapReduce

Distributed Computing Framework

CORE HADOOP SYSTEM COMPONENTS

Page 6: Data Wrangling and Oracle Connectors for Hadoop

7

Hadoop and DatabasesDatabases

“Schema-on-Write”Hadoop

“Schema-on-Read”

Schema must be created before any data can be loaded

An explicit load operation has to take place which transforms data to DB internal structure

New columns must be added explicitly

Data is simply copied to the file store, no transformation is needed

Serializer/Deserlizer is applied during read time to extract the required columns

New data can start flowing anytime and will appear retroactively

1) Reads are Fast

2) Standards and GovernancePROS

1) Loads are Fast

2) Flexibility and Agility

Page 7: Data Wrangling and Oracle Connectors for Hadoop

8

Hadoop rocks Data Wrangling

• Cheap storage for messy data• Tools to play with data:

• Acquire• Clean• Transform

• Flexibility where you need it most

Page 8: Data Wrangling and Oracle Connectors for Hadoop

9

Got unstructured data?

• Data Warehouse:• Text• CSV• XLS• XML

• Hadoop:• HTML• XML, RSS• JSON• Apache Logs• Avro, ProtoBuffs, ORC, Parquet• Compression• Office, OpenDocument, iWorks• PDF, Epup, RTF• Midi, MP3• JPEG, Tiff• Java Classes• Mbox, RFC822• Autocad• TrueType Parser• HFD / NetCDF

Page 9: Data Wrangling and Oracle Connectors for Hadoop

10

But Eventually, you need your data in your DWH

Oracle Connectors for HadoopRock Data Loading

Page 10: Data Wrangling and Oracle Connectors for Hadoop

11

What Data Wrangling Looks Like?

Source Acquire Clean Transform Load

Page 11: Data Wrangling and Oracle Connectors for Hadoop

12

Data Sources

• Internal• OLTP• Log files• Documents• Sensors / network events

• External:• Geo-location• Demographics• Public data sets• Websites

Page 12: Data Wrangling and Oracle Connectors for Hadoop

13

Free External DataName URL

U.S. Census Bureau http://factfinder2.census.gov/

U.S. Executive Branch http://www.data.gov/

U.K. Government http://data.gov.uk/

E.U. Government http://publicdata.eu/

The World Bank http://data.worldbank.org/

Freebase http://www.freebase.com/

Wikidata http://meta.wikimedia.org/wiki/Wikidata

Amazon Web Services http://aws.amazon.com/datasets

Page 13: Data Wrangling and Oracle Connectors for Hadoop

14

Data for SellSource Type URL

Gnip Social Media http://gnip.com/

AC Nielsen Media Usage http://www.nielsen.com/

Rapleaf Demographic http://www.rapleaf.com/

ESRI Geographic (GIS) http://www.esri.com/

eBay AucAon https://developer.ebay.com/

D&B Business Entities http://www.dnb.com/

Trulia Real Estate http://www.trulia.com/

Standard & Poor’s Financial http://standardandpoors.com/

Page 14: Data Wrangling and Oracle Connectors for Hadoop

15

Source Acquire Clean Transform Load

Page 15: Data Wrangling and Oracle Connectors for Hadoop

16

Getting Data into Hadopp

• Sqoop• Flume • Copy• Write• Scraping • Data APIs

Page 16: Data Wrangling and Oracle Connectors for Hadoop

Sqoop Import Examples

• Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username hr --table emp --where “start_date > ’01-01-2012’”

• Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table shops --split-by shop_id --num-mappers 16

Must be indexed or partitioned to avoid 16 full table

scans

Page 17: Data Wrangling and Oracle Connectors for Hadoop

18

Or…

• Hadoop fs -put myfile.txt /big/project/myfile.txt • curl –i list_of_urls.txt • curl https://api.twitter.com/1/users/show.json?screen_name=cloudera

{ "id":16134540, "name":"Cloudera", "screen_name":"cloudera", "location":"Palo Alto, CA", "url":"http://www.cloudera.com”"followers_count":11359 }

Page 18: Data Wrangling and Oracle Connectors for Hadoop

19

And even…

$cat scraper.pyimport urllibfrom BeautifulSoup import BeautifulSouptxt = urllib.urlopen("http:// www.example.com/")soup = BeautifulSoup(txt)headings = soup.findAll("h2")for heading in headings: print heading.string

Page 19: Data Wrangling and Oracle Connectors for Hadoop

20

Source Acquire Clean Transform Load

Page 20: Data Wrangling and Oracle Connectors for Hadoop

21

Data Quality Issues

• Given enough data – quality issues are inevitable• Main issues:

• Inconsistent – “99” instead of “1999”• Invalid – last_update: 2036• Corrupt - #$%&@*%@

Page 21: Data Wrangling and Oracle Connectors for Hadoop

22

Happy families are all alike.Each unhappy family is unhappyin its own way.

Page 22: Data Wrangling and Oracle Connectors for Hadoop

23

Endless Inconsistencies

• Upper vs. lower case• Date formats• Times, time zones, 24h• Missing values • NULL vs. empty string vs. NA• Variation in free format input

• 1 PATCH EVERY 24 HOURS• Replace patches on skin daily

Page 23: Data Wrangling and Oracle Connectors for Hadoop

24

Hadoop Strategies

• Validation script is ALWAYS first step

• But not always enough

• We haveknown unknowns and unknowns unknowns

Page 24: Data Wrangling and Oracle Connectors for Hadoop

25

Known Unknowns

• Script to:• Check number of columns per row• Validate not-null• Validate data type (“is number”)• Date constraints• Other business logic

Page 25: Data Wrangling and Oracle Connectors for Hadoop

26

Unknown Unknowns

• Bad records will happen• Your job should move on• Use counters in Hadoop job to count bad records• Log errors • Write bad records to re-loadable file

Page 26: Data Wrangling and Oracle Connectors for Hadoop

27

Solving Bad Data

• Can be done at many levels:• Fix at source• Improve acquisition process• Pre-process before analysis• Fix during analysis

• How many times will you analyze this data?• 0,1, many, lots

Page 27: Data Wrangling and Oracle Connectors for Hadoop

28

Source Acquire Clean Transform Load

Page 28: Data Wrangling and Oracle Connectors for Hadoop

29

Endless Possibilities

• Map Reduce (in any language)

• Hive (i.e. SQL)• Pig• R• Shell scripts• Plain old Java

Page 29: Data Wrangling and Oracle Connectors for Hadoop

30

De-Identification

• Remove PII data• Names, addresses, possibly

more• Remove columns

• Remove IDs *after* joins• Hash• Use partial data• Create statistically similar fake

data

Page 30: Data Wrangling and Oracle Connectors for Hadoop

31

87% of US populationcan be identified from gender, zip code and date of birth

Page 31: Data Wrangling and Oracle Connectors for Hadoop

32

Joins

• Do at source if possible• Can be done with MapReduce• Or with Hive (Hadoop SQL )• Joins are expensive:

• Do once and store results• De-aggregate aggressively

• Everything a hospital knows about a patient

Page 32: Data Wrangling and Oracle Connectors for Hadoop

33

DataWrangler

Page 33: Data Wrangling and Oracle Connectors for Hadoop

34

Process Tips

• Keep track of data lineage• Keep track of all changes to data• Use source control for code

Page 34: Data Wrangling and Oracle Connectors for Hadoop

35

Source Acquire Clean Transform Load

Page 35: Data Wrangling and Oracle Connectors for Hadoop

36

Sqoop

sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /results/bar_data

Page 36: Data Wrangling and Oracle Connectors for Hadoop

37

FUSE-DFS

• Mount HDFS on Oracle server:• sudo yum install hadoop-0.20-fuse• hadoop-fuse-dfs

dfs://<name_node_hostname>:<namenode_port> <mount_point>

• Use external tables to load data into Oracle

Page 37: Data Wrangling and Oracle Connectors for Hadoop

38

That’s nice.But can you load data FAST?

Page 38: Data Wrangling and Oracle Connectors for Hadoop

39

Oracle Connectors

• SQL Connector for Hadoop• Oracle Loader for Hadoop• ODI with Hadoop• OBIEE with Hadoop• R connector for Hadoop

You don’t need BDA

Page 39: Data Wrangling and Oracle Connectors for Hadoop

40

Oracle Loader for Hadoop

• Kinda like SQL Loader• Data is on HDFS• Runs as Map-Reduce job• Partitions, sorts, converts format to Oracle Blocks• Appended to database tables• Or written to Data Pump files for later load

Page 40: Data Wrangling and Oracle Connectors for Hadoop

41

Oracle SQL Connector for HDFS

• Data is in HDFS• Connector creates external table• That automatically matches Hadoop data• Control degree of parallelism

• You know External Tables, right?

Page 41: Data Wrangling and Oracle Connectors for Hadoop

43

Data Types Supported

• Data Pump• Delimited text• Avro• Regular expressions• Custom formats

Page 42: Data Wrangling and Oracle Connectors for Hadoop

44

Main Benefit:Processing is done in Hadoop

Page 43: Data Wrangling and Oracle Connectors for Hadoop

45

Benefits

• High performance• Reduce CPU usage on Database• Automatic optimizations:

• Partitions• Sort• Load balance

Page 44: Data Wrangling and Oracle Connectors for Hadoop

46

Measuring Data Load

Concerns

How much time?

How much CPU?

BottlenecksDisk

CPU

Network

Page 45: Data Wrangling and Oracle Connectors for Hadoop

47

I Know What This Means:

Page 46: Data Wrangling and Oracle Connectors for Hadoop

48

What does this mean?

Page 47: Data Wrangling and Oracle Connectors for Hadoop

49

Measuring Data Load

• Disks: ~300MB /s each• SSD: ~ 1.6 GB/s each• Network:

• ~ 100MB/s (1gE) • ~ 1GB/s (10gE)• ~ 4GB/s (IB)

• CPU: 1 CPU second per second per core.• Need to know: CPU seconds per GB

Page 48: Data Wrangling and Oracle Connectors for Hadoop

50

Lets walk through this…

We have 5TB to loadEach core: 3600 seconds per hour

5000GB will take:With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hoursWith SQL Connector: 5000 * 40 = 55 cpu-hours

Our X2-3 half rack has 84 cores.So, around 30 minutes to load 5TB at 100% CPU.

Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate)And use all CPUs for loading

Page 49: Data Wrangling and Oracle Connectors for Hadoop

51

Given fast enough network and disks, data loading will take all available CPUThis is a good thing

Page 50: Data Wrangling and Oracle Connectors for Hadoop

52