lecture iv · compilation & invocation classpath: use bin/hadoop classpath compile: – in...

Lecture IV

More on Distributed File Systems, Space Filling Curves and MapReduce

Apache Hadoop

● Initially, infrastructure centered around HDFS for running MapReduce jobs

● However, grown to a general-purpose big data framework.

● For modern versions of Hadoop, the most important components for Hadoop/MapReduce are – YARN

– HDFS

– MapReduce Services

Hadoop MapReduce

The aspects of a MapReduce invocation are split into● YARN components

– One central ResourceManager

– One NodeManager per Node

● HDFS components– One central Name Node

– One DataNode per Node

● MapReduce components– One Central JobTracker – One Task-Tracker per Node

Hadoop Architecture

Example: Single Node Setup

● Running in Local Mode: Only unpack the file, for example, preferably on Linux or Mac.

● Windows users might need to install Cygwin.● Follow

https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

WordCount in Java

● WordCount consists of three components– Mapper taking the Input and creating pairs <Word,

1>

– Reducer / Combiner summing up the second value of the pairs

– Main Method setting up the infrastructure

Element 1: Map

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one);

}

}

}// Map

Element 2: Reduce

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}

Element 3: Main

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}

Job...

Output ...

Computation...

Input

Run

Compilation & Invocation

● Classpath: Use bin/hadoop classpath● Compile:

– In hadoop/wordcount (download from git)– mkdir classes– javac -cp $(../bin/hadoop classpath) WordCount.java -d classes– jar -cvf wordcount.jar -C classes/ .

● Run:– In hadoop:

– bin/hadoop jar wordcount/wordcount.jar de.uni_hannover.ikg.WordCount input output

Note that Hadoop refuses to overwrite. So delete output before running your code.

Output

[…]● "Hell," 6● "Hell? 2● "Hell?" 1● "Hellas," 1● "Hellburner 2

[...]

Job Statistics

File System Counters

FILE: Number of bytes read=11819944640

FILE: Number of bytes written=5082579924

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

Job Statistics

MapReduce Framework

Map input records=1525758

Map output records=12256351

Map output bytes=119763898

Map output materialized bytes=32703470

Input split bytes=26562

Combine input records=12256351

Combine output records=2265115

Reduce input groups=359638

Reduce shuffle bytes=32703470

Reduce input records=2265115

Reduce output records=359638

[...]

Running without Combine

Communication● With Combine,

Reduce shuffle bytes=32,703,470

● Without Combine

Reduce shuffle bytes=144,278,025

● This is 4.4 times more communication

Job Complexity

Combine input records=12256351

Combine output records=2265115

● Without Combine: Reduce is invoked 12,256,351 times● With Combine: Reduce is invoked only 2,265,115 times

This is 5.4 times more invocations

4.4x

5.4x

Assignment

● Modify WordCount such that it– Removes all non-alphabetic characters during Map

– Only stores words that are more often than a threshold given on Command Line

● Note that this case needs either a more flexible Reducer or a different Reducer and Combiner as we cant reject words that are not often enough in a Map result!

● Run Wordcount– In Standalone Mode

– In Pseudo-Distributed Mode

File Systems and Distributed File Systems

What is a file?

● A computer file is a block of data, typically on a persistent storage.

● It is usually accessed via the Operation System API with operations such as– Create a new file (POSIX: creat)

– Open a file (POSIX: open)

– Read data from a file (POSIX: read)

– Write data to a file (POSIX: write)

– Close a file (POSIX: close)

File Systems

● File Systems organize files into directories and take care of ownership and security

● Typical concepts and operations– Directory Tree, Path, and Working Directory

– Links (Hard, Symbolic)

– Move, Delete, Rename files

– File Attributes

– Special Files (Device Nodes, Memory Mapping)

Distributed? No problem.

There are many DFS. Some remarkable examples:● Microsoft Distributed File System (DFS)

extens MS infrastructure with consistent views of distributed directories. Low consistency.

● Andrew File System (AFS, Carnegie Mellon University)widely used by researchers and universities.

● GlusterFS (bought by Red Hat)collects free space across servers into a new virtual file system.

● HDFS (Hadoop file system)used in the Hadoop Big Data Platform for distributed storage

● XtreemFS (German Research Highlight) POSIX-compatible, fault-tolerant, scalable, reliable

Andrew File System

● Scales to tenth of thousands of servers● Supports replication, though not real-time● Supports consistent and persistent caching.● Assumes that the working set of each user fits into the

cache. A pre-big-data assumption!● Assumes non-database access (the majority of writes is

non-conflicting)● Whole-File Assumption: On opening a file, the complete

file is transferred to clientTo be fair: Striping is under development.

Andrew FS Overview

● Program opens file. If in cache and cache valid, gets served from local drive.● Otherwise, Venus asks Vice for a copy of the file.● Vice remembers to notify Venus in case of file change on server● After communication, program again works with local cache only; commiting changes back to Vice● For the program, the file looks similar to a file on the local drive. In fact, the program works with a

UNIX file descriptor.

Program Venus Vice

Unix / Linux Kernel Unix / Linux Kernel

Gluster FS

● Supported for Hadoop via Plugin (just a JAR)● Eliminates central NameNode● Fault-tolerant on File System Level● Works like a full FS (FUSE mountable, writes actual files)● Supports Striping● No changes to Hadoop/MapReduce code● Allows to run Hadoop over multiple Namespaces● Real Data Access through anything:

– FUSE (including Google Drive, Amazon Drive, AWS EBS, …)– NFS– SMB– SWIFT– and even HDFS ;-)

GlusterFS Deployment Example

Management Server

AmbariSSH

GlusterFS console

YARN Master

YARN Resource ManagerJob History Server

1 2 3 4

5 6 7 8

9 10 11 12

Each worker running• YARN Node Manager• glusterd

HDFS (Recall...)

1

1

2 3 4

NameNode

File

1 22

2

1

File Metadata and Access ismanaged by a dedicated NameNode

File contents is split into piecesSmaller than a predefined constant, e.g., 128 MB

Chunks are stored on a distributed system of data nodes with replication protecting from data loss due to expected node failures.

XtreemFS

● The only DFS that handles all failure modes including network splits! A good candidate for the future...

● But still under development, try it out...

Beyond File Systems

● File systems often implicitly have some assumptions, which are valid in most cases:– Files are read more often than they are written

– Files are written without conflicts at most times (e.g., file locking has to be done per application)

– Files have no structure beyond their size

– Files are seldom used partially (though striping is a remarkable example, remember city night lights in R)

Big Data is not about Files

● However, Big Data has three properties– Volume

● Can easily be handled by files, though the whole-file assumption can become problematic

● Counter-Measure: Semantically Sensible Splitting Mechanism

– Variety● Sometimes, only non-evenly spaced information in a file is relevant.

However, this cannot be retrieved by the file read API

– Velocity● High Speed Updates to Data break the typically only one writer per

file assumption or lead to unmanagable amounts of small files

Databases

In the era before Big Data, relational databases have been invented to mitigate problems arising from working with files.

RDBMS are structured in tables and typically provide● Atomicity

Either a whole transaction happens, or nothing● Consistency

Data can be constrained and linked between tables, for example, deleting a user can delete all associated data. Furthermore, anyone reading from the database at any time will see the same.

● IsolationConcurrent Transactions are guaranteed not to influence each other.

● DurabilityTransactions that happened are already persistent.

In short: RDBMS provide ACID semantics.

However...

● There is no scalable way of providing ACID semantics over distributed systems.

● The CAP theorem states, that we can only have two out of – Consistency

– Availability

– Partition Tolerance

and classical RDBMS choose Consistency over Availability. (In fact, you have to wait ;-)

NoSQL

● How can we go further?● We want

– Availability

– Partition Tolerance

– Multi-Writer

and are willling to sacrifice consistency (at least a bit)

NoSQL by Example:Apache Cassandra

Classical Relational DBMS Apache CassandraHandles moderate incoming data velocity Handles high incoming data velocity

Data arriving from one/few locations Data arriving from many locations

Manages primarily structured data Manages all types of data

Supports complex/nested transactions Supports simple transactions

Single points of failure with failover No single points of failure; constant uptime

Supports moderate data volumes Supports very high data volumes

NoSQL by Example:Apache Cassandra

Classical Relational DBMS Apache CassandraCentralized deployments Decentralized deployments

Data written in mostly one location Data written in many locations

Supports read scalability (with limited consistency)

Supports read and write scalability

Deployed in vertical scale up fashion Deployed in horizontal scale out fashion

Cassandra Basic Structure

● The nodes of a Cassandra Key-Value Store form a simple headless ring.

● All nodes have equal functionality● Ring Communication is done

via a Gossip protocol

Cassandra Writing

● Full Data Durability with High Performance: Commit Log, MemTable and SSTables:

Cassandra Reading

● Reads are done using a structure called Bloom Filter to avoid useless communication / disk IO:

Cassandra Replication

Replication is a central element in Cassandra to allow for nodes to join or leave the cluster at any time. Replication is influenced by four components:● Virtual Nodes

Assigns Data Ownership to Physical Machines● Partitioner

Partitions the Data● Replication Strategy

Defines, to which points in the ring replicas go● Snitch

Defines additional topology information, for example, from Cloud provider (such as Amazon Availability Zones)

Partitioner

● All data in Cassandra is a key/value pair. ● In Cassandra tables, the key is derived from the table PRIMARY

KEY, which can even be a compound key of multiple columns.● A Token represents a range of Keys and every node is

responsible for certain ranges of keys.● The Partitioner takes the key and calculates a Token, where

the key falls in. This can be done in various ways:– Uniformly (OrderPreservingPartitioner)

– Random (RandomPartitioner)

– Fast Random (Murmur3Partitioner)

Replication

● Data Replication in Cassandra is very simple, not to say minimalistic:– Simple Strategy: Find out the node holding the

token, where the Partitioner sent the current row and replicate it to those nodes around the ring, who don't have the data until ReplicationFactor number of nodes have the data.

– Network Topology Strategy: Find out the node holding the token. Then continue to place replica to hosts around the ring in different racks

A Cassandra Ring

Ring without Vnodes, Replication Factor 3.

● Each Token stands for a Range of Keys● Each Node is responsible for a Token● Each Row is replicated around the ring to the next nodes

Introducing Virtual Nodes (Vnodes)

● More Tokens, as each node is split into several virtual nodes● Flexible Movement of Tokens in case of failures● Efficient and Simple Load Balancing

Bloom Filter

Storing Sets with Limited Memory

Bloom Filter

● Cassandra Data is organized using Tokens calculated from row-key, which can be random– An MD5 hash of the key for RandomPartitioner

– A Murmur3 hash for Murmur3Partitioner (the default)

– The key itself for OrderPreservingPartitioner

● For Big Data, not all data can stay in the Memtable cache. Therefore, a query might need to visit SSTables.

● Apache Cassandra employs Bloom Filters to find out, which SSTables could contain a given key.

Switch to BloomFilter.pdf

Hadoop / Cassandra / MapReduce

● As Cassandra is just a highly efficient, user-friendly, headless storage engine, it can be used as data source and data sink for Apache Hadoop MapReduce jobs.

● While the TaskTracker has been able to use Data Locality in Cassandra, this feature has gone.

● However, you can still rely on the read and write everywhere structure of Cassandra or modify the TaskTracker to support your pattern of data locality.

Apache Hive (Overview)

● There is another large Apache project called Hive, which is optimized for Data Warehousing.– In Data Warehouses, a lot of different data is being collected into a single DBMS for running

analytics on it.

– Typical Workload: Extract, Transform, Load (ETL)

● Apache Hive is a Data Warehouse on Top of Hadoop HDFS with the following key features:– Large support for SQL

– ODBC/JDBC Connections (e.g., Excel, Access)

– Execution of Queries are Compiled to MapReduce jobs automatically, or are run against Spark (not yet fully optimized, but already faster than MR)

– structure projection for unstructured data (no need to fill the data in again)● CSV / Text● Apache Parquet● Apache ORC

Warning

● Though Hive supports many modern SQL features, it is not meant for using relational queries, for example, extensively joining or foreign keys.

● In fact, Hive should always be used in a denormalization way. Materializing Query results is not unwanted redundancy, it is highly efficient processing.

● Tip: Use Compression wherever supported, as I/O is the big data bottleneck and not CPU.

Apache Big DataLow Level Projects

● Apache HDFS: Distributed File System● Apache Ignite: In-Memory Real-Time Processing● Apache MapReduce: MapReduce framework● Apache Pig: Data Flow Programming (similar to functional programming● Apache Spark: Solves similar problems as MapReduce, but faster using a functional

philosophy● Apache Storm: Stream Computation Framework● Apache Flink: Large Computation Graphs (DAG)● Apache HBase: The Hadoop Database● Apache Cassandra: The central Key-Value-Store ● Apache Hive: SQL over Hadoop / MapReduce● Apache Phoenix: SQL over Hbase● Apache Kafka: Publish-Subscribe Stream Processing● Apache Oozie: DAG Scheduler for multiple MapReduce jobs (e.g., trigger on availability)

Apache Big DataApplication Level

● Apache Mahout: Scalable Machine Learning over MapReduce

● WEKA3: Allows for Data Mining using some Weka implementations

● Cloudera Oryx: Machine Learning with a Business Perspective

● Apache Spark MLLib: Machine Learning over Spark– Widest support: Java, Scala, Python, R

Spatial Data Distribution

Spatial Data Distribution

● Spatial Data Distribution is the central question of Geospatial Big Data– How do I distribute my data between nodes in

order to increase data locality?

● Two general strategies can be differentiated:– Ring-based Architectures: Similar to Cassandra

– Block-based Architectures: Similar to HDFS

Ring vs. Block

● Ring-based Architectures:– Data is distributed between nodes according to an

ordering of the data

Example: The ordering in Cassandra is given by the key.

● Block-based Architectures– Data is first split into meaningful blocks, which are

then distributed between computing nodes.

Ordering Spatial Data

● Central Idea: Space Filling Curves● A space-filling curve is a curve, often defined as

the limit of a sequence of curves, which provides a continuous map from the unit interval [0,1] to the unit square [0,1]x[0,1]

● We will most often use some real curves not from the limit, which visit all cells of a cell decomposition of the unit square.

Peano Curve

Giuseppe Peano (1858-1932)

Hilbert Curve

David Hilbert (1862-1943)

Morton Order

Guy Macdonald Morton (1966)

Properties

● Peano– Complex, medium locality

● Hilbert– Very Good locality

– Complex to project to and from

● Z-Curve– Good locality

– Easy to project into and from

– Constant time neighbors

– Spatially known: It is the same as a Depth-First traversal of a quadtree

Geohash

● The Z-curve has been used to derive the Geohash mapping cells in the world to strings using Z-order and Base32 coding of the bit sequences.

Geohash Cells

High-Level Z Curve

● Encoding: Given a point P(X,Y) in some spatial reference system– Discretizer: First, discretize X and Y into positive integer numbers of fixed

length, e.g. 16bit.

– Zip: Mix the bits of both alternatingly into a new integer.

– Encode (Optional): Encode the bit string to get a concise, human-readable reresentation.

● Decoding: Given a string or integer– Decode (Optional): Decode Base32 to get an integer

– Dezip: Create two integers by dezipping the bits

– Reverse:● Either return the Center Point of the cell repesented by the given Z Curve key● Or return the bounding box.

High-Level Z Curve

● Neighbors:– Given a Base32 string, neighbors can be calculated by

table lookup on character basis. See the various implementations of Geohash for details.

– Given an integer, use the formulas● top = ((z & 0b10101010) - 1 & 0b10101010) | (z &

0b01010101)● bottom = ((z | 0b01010101) + 1 & 0b10101010) | (z &

0b01010101)● left = ((z & 0b01010101) - 1 & 0b01010101) | (z & 0b10101010)● right = ((z | 0b10101010) + 1 & 0b01010101) | (z & 0b10101010)

Assignment

● Implement or Download– A Zcurve encoder / decoder, possibly Geohash.

● Think about what would happen, if we:– Use the Geohash as the key for ordering points

– Use the OrderPreserving Strategy of Apache Cassandra

– Calculate the number of points within 1km distance for each point using MapReduce

● Measure– Whether the Zcurve gains speed for this query on a real cluster

Block Distribution Strategies

Spatial Indexing Structures

● Spatial Indexing Structures enable fast spatial access to points, typically supporting– Range Queries: Retrieve all geometries that are

within a specific Range, often a Rectangle or Sphere

– kNN Queries: Retrieve the k nearest neighbors to a point.

Grid Index

● Overlay the data with a grid and collect all points falling into specific cells of the grid.– Specific Queries only need to get neighboring grids

– Tradeoff between● Number of Empty Cells● Number of Points in each Grid Cell

Good baseline index for (nearly) uniform data

Example: BSP

● Binary Space Partitioning Tree (BSP)– For a given tree node, split all geometries at a

central hyperplane (line for 2D) and create two new nodes containing all geometries from the sides of the hyperplane.

● Central can be defined in various ways.

– Recurse, until a given tree node contains only few points.

Example: Ball Trees

● Ball Tree– Each node in the tree represents a ball (circle in

2D) and all points falling into this ball.

– If there are too many points inside some ball, split the ball into two or more balls covering all points and build subtree nodes according to this splitting.

R trees and R* trees

● R tree– Each node in the tree represents the Minimum

Bounding Rectangle (MBR) of all points inside (or below) this node.

– If a node contains too many points, split it into two MBRs with similarly many points.

● R* tree – Advanced insertion algorithm making the tree one of the

most efficient and widely used spatial indices for nearest neighbor queries.

Lifting those Indizes to Hadoop?

● A matter of much work and open research– Hotspots can easily arise for simple indizes such as

grids

– Non-Uniform access patterns / varying chunk sizes can be results

– Efficient spatial replication and data locality has not yet been fully thought through...

Open Research Topic, possibly worth some Master theses, if you are interested.

SpatialHadoop is one approach...

Spatial Block Distribution

SpatialHadoop

● Uses spatial indizes for distributing blocks in HDFS, hence, jobs can be put to nodes that already have the data locally.

● MapReduce Components– SpatialFileSplitter

– SpatialRecordReader

● Query Components– Range Queries– kNN Queries– Spatial Joins

Distributed Index Model

Data Mining from Location

Classification and Clustering Based on Point Distance

Data Mining

● Data Mining is the process of extracting structures and information from large bodies of data.

● Data Mining can be split into – Supervised Approaches

– Unsupervised Approaches

Supervised Data Mining

● In Supervised Data Mining, a given dataset contains the intended result.

● This dataset can be used as a training dataset in order to extract a structural representation of the data

● This structural representation is called model and can be applied to unknown data assigning what the model would predict

Example: Classification

● Given a table with k attributes and a nominal class variable, learn to infer the class from the attributes.

● Classical Example: Iris Dataset– Question: Can we infer from four measurements of

sizes of a flower its species?

The IRIS Dataset

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. _Annals of Eugenics_, *7*, Part II, 179-188.

1NN classificationA very simple Classification scheme● Nearest Neighbor Classification is a very simple

though spatially quite useful classification algorithm

● It takes all training data and assigns to a location the class of the nearest training instance.

● Let us look at how it works:

kNN in R

data(iris)# Select 10 random training rowstrain = iris[sample(1:nrow(iris),10),]# Apply 1nn on trainingset to full iris setres = knn(train[,1:4],iris[,1:4],train$Species,1)summary(res==iris$Species) Mode FALSE TRUE NA's logical 16 134 0

plot(iris$Petal.Length,iris$Petal.Width, col=iris$Species)points(Petal.Width~Petal.Length, data=iris[which(iris$Species != res),], pch=4, cex=2, col="red", lwd=2)

lecture iv · compilation & invocation classpath: use bin/hadoop classpath compile: – in...

Documents