hadoop - yappidays.ruyappidays.ru/wp-content/uploads/2017/09/hadoop-2017-yaroslavl.pdf · titan...

Hadoop

Alexey Zinovyev, Java/BigData Trainer in EPAM

About

With IT since 2007

With Java since 2009

With Hadoop since 2012

With Spark since 2014

With EPAM since 2015

3Training from Zinoviev Alexey

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

vk.com/big_data_russia Big Data Russia

+ Telegram @bigdatarussia

vk.com/java_jvm Java & JVM langs

+ Telegram @javajvmlangs

4Hadoop Jungle

Main parts

• What is BIG DATA?

• Intro in Hadoop

• HDFS & YARN

• MapReduce Java API

• JOINs techniques*

• JVM Settings*

• File formats*

5Hadoop Jungle

WHAT IS BIG DATA?

6Hadoop Jungle

Joke about Excel

7Hadoop Jungle

Every 60 seconds…

8Hadoop Jungle

From Mobile Devices

9Hadoop Jungle

From Industry

10Hadoop Jungle

We started to keep and handle stupid new things!

11Hadoop Jungle

Crazy Zoo

2012

12Hadoop Jungle

13Hadoop Jungle

10^6 rows

in MySQL

14Hadoop Jungle

GB->TB->PB->?

15Hadoop Jungle

Is BigData about PBs?

16Hadoop Jungle

Is BigData about PBs?

17Hadoop Jungle

It’s hard to …

• .. store

• .. handle

• .. search in

• .. visualize

• .. send in network

18Hadoop Jungle

Just do it … in parallel

19Hadoop Jungle

Parallel Computing vs Distributed Computing

20Hadoop Jungle

You need to develop

• .. distributed on-disk storage

• .. in-memory storage (or shared memory buffer)

• .. thread pool to run hundreds of threads

• .. synchronize all components

• .. provide API for reusing by other developers

21Hadoop Jungle

All we love reinvent bicycles, but…

22Hadoop Jungle

HADOOP

23Hadoop Jungle

Hadoop

24Hadoop Jungle

Disks Performance

25Hadoop Jungle

The main concept

Let’s read data in parallel

26Hadoop Jungle

“Cheap” cluster

27Hadoop Jungle

Das Ist Musst surviven!

28Hadoop Jungle

• Hadoop Commons

• Hadoop Clients

• HDFS

• YARN

• MapReduce

Main components

29Hadoop Jungle

Hadoop frameworks

• Universal (MapReduce, Tez, RDD in Spark)

• Abstract (Pig, Pipeline Spark)

• SQL - like (Hive, Impala, Spark SQL)

• Processing graph (Giraph, GraphX)

• Machine Learning (Mahout, MLib)

• Stream processing (Spark Streaming, Storm)

30Hadoop Jungle

Hadoop Architecture

31Hadoop Jungle

• Automatic parallelization and distribution

• Fault-tolerance

• Data Locality

• Writing the Map and Reduce functions only

• Single-threaded model

Key features

32Hadoop Jungle

HDFS DAEMONS

33Hadoop Jungle

The main idea

'Time to transfer' > 'Time to seek'

34Hadoop Jungle

Main idea

35Hadoop Jungle

• NameNode

• DataNode

• SecondaryNode

• StandbyNode

• Checkpoint Node

• Backup Node

HDFS node types

36Hadoop Jungle

The main thought about HDFS

HDFS node is JVM daemon

37Hadoop Jungle

• monitor with JMX

• use jmap, jps and so on..

• configure NameNode Heap Size

• use power of JVM flags

You can do it with HDFS node

38Hadoop Jungle

YARN

39Hadoop Jungle

From Hadoop 1 to Hadoop 2

40Hadoop Jungle

Daemons in YARN

41Hadoop Jungle

Different

scheduling

algorithms

42Hadoop Jungle

MAPREDUCE THEORY

43Hadoop Jungle

MapReduce in different languages

44Hadoop Jungle

Think in Key-Value style

map (k1, v1) → list(k2, v2)

reduce (k2, list(v2*) )→ list(k3, v3)

45Hadoop Jungle

• WordCount

• Log handling

• Filtering

• Reporting Preparation

MR Typical Tasks

46Hadoop Jungle

Why should we use MapReduce?

47Hadoop Jungle

We try to reduce ‘time to act’ but keep BigData

48Hadoop Jungle

Classic Batch

49Hadoop Jungle

Do you like batches?

50Hadoop Jungle

Fixed Windows

51Hadoop Jungle

Filter all elements

52Hadoop Jungle

MAPREDUCE

FRAMEWORK

53Hadoop Jungle

Pay attention!

Hadoop != MapReduce

54Hadoop Jungle

Yet One YARN’s lover

55Hadoop Jungle

56Hadoop Jungle

• Map

• Shuffle

• Reduce

Main steps

57Hadoop Jungle

Minimal

Runner

public class MinimalMapReduce extends Configured implements Tool {@Overridepublic int run(String[] args) throws Exception {

Job job = new Job(getConf());job.setJarByClass(getClass());FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MinimalMapReduce(), args);System.exit(exitCode);

}}

58Hadoop Jungle

Minimal

Runner



}



}}

59Hadoop Jungle

Minimal

Runner



}



}}

60Hadoop Jungle

Job Config

job.setInputFormatClass(TextInputFormat.class);

job.setMapperClass(Mapper.class);

job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);

job.setPartitionerClass(HashPartitioner.class);

job.setNumReduceTasks(1);

job.setReducerClass(Reducer.class);

job.setOutputKeyClass(LongWritable.class);

job.setOutputValueClass(Text.class);

job.setOutputFormatClass(TextOutputFormat.class);

61Hadoop Jungle

Would you like to config in Java?

62Hadoop Jungle

WordCount

63Hadoop Jungle

TESTING & DEVELOPMENT

64Hadoop Jungle

MR Unit idea

65Hadoop Jungle

Do it

separatly

66Hadoop Jungle

Simple Test

public class MRUnitHelloWorld {MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;

@Beforepublic void setUp() {

WordMapper mapper = new WordMapper();mapDriver = new MapDriver<LongWritable, Text, Text,

IntWritable>();mapDriver.setMapper(mapper);

}

@Testpublic void testMapper() {

mapDriver.withInput(new LongWritable(1), new Text("cat dog"));

mapDriver.withOutput(new Text("cat"), new IntWritable(1));mapDriver.withOutput(new Text("dog"), new IntWritable(1));mapDriver.runTest();

}}

67Hadoop Jungle

• First develop/test in local mode using small amount of

data

• Test in pseudo-distributed mode and more data

• Test on fully distributed mode and even more data

• Final execution: fully distributed mode & all data

Testing strategies

68Hadoop Jungle

MRUnit

69Hadoop Jungle

FILE FORMATS

70Hadoop Jungle

Input/Output Formats

• Text based (CSV, TSV, JSON, XML)

• Sequence Files

• Column based (Parquet, RCFile, ORC)

• Avro

• HBase formats

• Custom formats

71Hadoop Jungle

ORC vs PARQUET

72Hadoop Jungle

Codec Performance on the Wikipedia Text Corpus

64%, 32.3

71%, 60.0

47%, 4.842%, 4.0

44%, 2.4

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

40% 45% 50% 55% 60% 65% 70% 75%

CPU

Tim

e i

n S

ec.

(Com

pre

ss +

Decom

pre

ss)

Space Savings

Bzip2

Zlib

(Deflate, Gzip)

LZOSnappy

LZ4

High Compression Ratio

High Compression Speed

73Hadoop Jungle

* LEVEL

74Hadoop Jungle

75Hadoop Jungle

The main performance idea

Reduce shuffle time & resources

76Hadoop Jungle

MAPREDUCE ADVANCED

77Hadoop Jungle

Customize MapReduce!

78Hadoop Jungle

MapReduce for WordCount

79Hadoop Jungle

WordCount Combiner

80Hadoop Jungle

Combine

81Hadoop Jungle

Combiner

82Hadoop Jungle

Can we make something around

map(), reduce() calls?

83Hadoop Jungle

Setup

/*** Called once at the start of the task.*/protected void setup(Context context

) throws IOException, InterruptedException {

// Prepare something for each Mapper or Reducer// Validate external sources

}

84Hadoop Jungle

Setup

/*** Called once at the end of the task.*/protected void cleanup(Context context

) throws IOException, InterruptedException{

// Finish something after each Mapper or Reducer// Handle specific exceptions

}

85Hadoop Jungle

Full control

86Hadoop Jungle

Run Mapper

/*** Expert users can override this method for more complete control over the* execution of the Mapper.* @param context* @throws IOException*/public void run(Context context) throws IOException, InterruptedException {

setup(context);try {

while (context.nextKeyValue()) {map(context.getCurrentKey(), context.getCurrentValue(),

context);}

} finally {cleanup(context);

}}

87Hadoop Jungle

Run

Reducer

/*** Advanced application writers can use the * {@link #run(*.Reducer.Context)} method to* control how the reduce task works.*/public void run(Context context) throws IOException, InterruptedException {

setup(context);try {

while (context.nextKey()) {reduce(context.getCurrentKey(), context.getValues(),

context);// If a back up store is used, reset itIterator<VALUEIN> iter = context.getValues().iterator();if(iter instanceof ReduceContext.ValueIterator) {

((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore(); }

}} finally {

cleanup(context);}

}

88Hadoop Jungle

REDUCER IS A PARTICIO

OF PROBLEMO

89Hadoop Jungle

Could we skip shuffle step?

90Hadoop Jungle

Map-Only ‘MapReduce’ Jobs

91Hadoop Jungle

May I customize Data Flow

before shuffling?

92Hadoop Jungle

Hash Partitioner just do it..

public class HashPartitioner<K2, V2> extends Partitioner<K2, V2> {

public int getPartition(K2 key, V2 value, int numReduceTasks) {

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

}}

93Hadoop Jungle

Partitioner’s Role in Shuffle and Sort

94Hadoop Jungle

Full power

95Hadoop Jungle

Custom

Partitioner

96Hadoop Jungle

JOINS

97Hadoop Jungle

Reduce

JOIN

98Hadoop Jungle

First idea

Let’s skip Reduce + Shuffle

99Hadoop Jungle

Second idea

Let’s copy small table on nodes

100Hadoop Jungle

Map-side join

101Hadoop Jungle

Map-side join for large datasets

102Hadoop Jungle

What about Really Large Tables?

SELECT Employees.Name, Employees.Age, Department.Name FROM Employees INNER JOIN Department ON Employees.Dept_Id=Department.Dept_Id

103Hadoop Jungle

The main JOIN idea for large tables

Redis or Memcache cluster as Distributed Cache

104Hadoop Jungle

PERFORMANCE

Let’s run on

JVM!

106Hadoop Jungle

Typical mistakes

• Collections are stored and sorted in memory

• Logging each input key-value pairs

• JARs hell

• Skew input: all records go to one reducer

• Forget that mapper/reduce is run on different JVM

107Hadoop Jungle

Performance tips

• Correct data storage (on JVM )

• Don’t forget about combiner

• Use appropriate Writable type

• Min required replication factor

• Tune your JVM

• Think in terms Big-O

108Hadoop Jungle

JVM tuning

• mapred.child.java.opts (heap for tasks)

• -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

• Low-latency GC collector –XX:+UseConcMarkSweepGC,

-XX:ParallelGCThreads

• Xmx == Xms (max and starting heap size)

109Hadoop Jungle

JVM Reusing: Uber Task

110Hadoop Jungle

And we can DO IT!

Real-Time Data-Marts

Batch Data-Marts

Relations Graph

Ontology Metadata

Search Index

Events & Alarms

Real-time Dashboarding

Events & Alarms

All Raw Data backupis stored here

Real-time DataIngestion

Batch DataIngestion

Real-Time ETL & CEP

Batch ETL & Raw Area

Scheduler

Internal

External

Social

HDFS → CFSas an option

Time-Series Data

Titan & KairosDBstore data in Cassandra

Push Events & Alarms (Email, SNMP etc.)

111Hadoop Jungle

It reminds me …

112Hadoop Jungle

MapReduce is not a ideal approach! But it works!

113Hadoop Jungle

• Move to Java 8

• Support more than 2 NameNodes (multiple standby

NameNodes)

• Derive heap size or mapreduce.*.memory.mb automatically

• Work with SSD, RAM, HDD, CPU as resources for YARN

• Support Docker containers

Hadoop 3: Roadmap

114Hadoop Jungle

Would you like to know more?

115Hadoop Jungle

Recommended Books

116Hadoop Jungle

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

vk.com/big_data_russia Big Data Russia

+ Telegram @bigdatarussia

vk.com/java_jvm Java & JVM langs

+ Telegram @javajvmlangs

117Hadoop Jungle

Github

Spark Tutorial: Core, Streaming, Machine Learning

https://github.com/zaleslaw/Spark-Tutorial

https://github.com/zaleslaw/Spark-Tutorial

118Hadoop Jungle

Gitbook

Обработка данных на Spark 2.2 и Kafka 0.10

www.gitbook.com/book/zaleslaw/data-processing-book

https://www.gitbook.com/book/zaleslaw/data-processing-book

119Hadoop Jungle

Any questions?

hadoop - yappidays.ruyappidays.ru/wp-content/uploads/2017/09/hadoop-2017-yaroslavl.pdf · titan...

Documents