lecture iv · compilation & invocation classpath: use bin/hadoop classpath compile: – in...
TRANSCRIPT
![Page 1: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/1.jpg)
Lecture IV
More on Distributed File Systems, Space Filling Curves and MapReduce
![Page 2: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/2.jpg)
Apache Hadoop
● Initially, infrastructure centered around HDFS for running MapReduce jobs
● However, grown to a general-purpose big data framework.
● For modern versions of Hadoop, the most important components for Hadoop/MapReduce are – YARN
– HDFS
– MapReduce Services
![Page 3: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/3.jpg)
Hadoop MapReduce
The aspects of a MapReduce invocation are split into● YARN components
– One central ResourceManager
– One NodeManager per Node
● HDFS components– One central Name Node
– One DataNode per Node
● MapReduce components– One Central JobTracker – One Task-Tracker per Node
![Page 4: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/4.jpg)
Hadoop Architecture
![Page 5: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/5.jpg)
![Page 6: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/6.jpg)
Example: Single Node Setup
● Running in Local Mode: Only unpack the file, for example, preferably on Linux or Mac.
● Windows users might need to install Cygwin.● Follow
https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html
![Page 7: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/7.jpg)
WordCount in Java
● WordCount consists of three components– Mapper taking the Input and creating pairs <Word,
1>
– Reducer / Combiner summing up the second value of the pairs
– Main Method setting up the infrastructure
![Page 8: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/8.jpg)
Element 1: Map
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}// Map
![Page 9: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/9.jpg)
Element 2: Reduce
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
![Page 10: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/10.jpg)
Element 3: Main
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Job...
Output ...
Computation...
Input
Run
![Page 11: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/11.jpg)
Compilation & Invocation
● Classpath: Use bin/hadoop classpath● Compile:
– In hadoop/wordcount (download from git)– mkdir classes– javac -cp $(../bin/hadoop classpath) WordCount.java -d classes– jar -cvf wordcount.jar -C classes/ .
● Run:– In hadoop:
– bin/hadoop jar wordcount/wordcount.jar de.uni_hannover.ikg.WordCount input output
Note that Hadoop refuses to overwrite. So delete output before running your code.
![Page 12: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/12.jpg)
Output
[…]● "Hell," 6● "Hell? 2● "Hell?" 1● "Hellas," 1● "Hellburner 2
[...]
![Page 13: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/13.jpg)
Job Statistics
File System Counters
FILE: Number of bytes read=11819944640
FILE: Number of bytes written=5082579924
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
![Page 14: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/14.jpg)
Job Statistics
MapReduce Framework
Map input records=1525758
Map output records=12256351
Map output bytes=119763898
Map output materialized bytes=32703470
Input split bytes=26562
Combine input records=12256351
Combine output records=2265115
Reduce input groups=359638
Reduce shuffle bytes=32703470
Reduce input records=2265115
Reduce output records=359638
[...]
![Page 15: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/15.jpg)
Running without Combine
Communication● With Combine,
Reduce shuffle bytes=32,703,470
● Without Combine
Reduce shuffle bytes=144,278,025
● This is 4.4 times more communication
Job Complexity
Combine input records=12256351
Combine output records=2265115
● Without Combine: Reduce is invoked 12,256,351 times● With Combine: Reduce is invoked only 2,265,115 times
This is 5.4 times more invocations
4.4x
5.4x
![Page 16: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/16.jpg)
Assignment
● Modify WordCount such that it– Removes all non-alphabetic characters during Map
– Only stores words that are more often than a threshold given on Command Line
● Note that this case needs either a more flexible Reducer or a different Reducer and Combiner as we cant reject words that are not often enough in a Map result!
● Run Wordcount– In Standalone Mode
– In Pseudo-Distributed Mode
![Page 17: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/17.jpg)
File Systems and Distributed File Systems
![Page 18: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/18.jpg)
What is a file?
● A computer file is a block of data, typically on a persistent storage.
● It is usually accessed via the Operation System API with operations such as– Create a new file (POSIX: creat)
– Open a file (POSIX: open)
– Read data from a file (POSIX: read)
– Write data to a file (POSIX: write)
– Close a file (POSIX: close)
![Page 19: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/19.jpg)
File Systems
● File Systems organize files into directories and take care of ownership and security
● Typical concepts and operations– Directory Tree, Path, and Working Directory
– Links (Hard, Symbolic)
– Move, Delete, Rename files
– File Attributes
– Special Files (Device Nodes, Memory Mapping)
![Page 20: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/20.jpg)
Distributed? No problem.
There are many DFS. Some remarkable examples:● Microsoft Distributed File System (DFS)
extens MS infrastructure with consistent views of distributed directories. Low consistency.
● Andrew File System (AFS, Carnegie Mellon University)widely used by researchers and universities.
● GlusterFS (bought by Red Hat)collects free space across servers into a new virtual file system.
● HDFS (Hadoop file system)used in the Hadoop Big Data Platform for distributed storage
● XtreemFS (German Research Highlight) POSIX-compatible, fault-tolerant, scalable, reliable
![Page 21: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/21.jpg)
Andrew File System
● Scales to tenth of thousands of servers● Supports replication, though not real-time● Supports consistent and persistent caching.● Assumes that the working set of each user fits into the
cache. A pre-big-data assumption!● Assumes non-database access (the majority of writes is
non-conflicting)● Whole-File Assumption: On opening a file, the complete
file is transferred to clientTo be fair: Striping is under development.
![Page 22: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/22.jpg)
Andrew FS Overview
● Program opens file. If in cache and cache valid, gets served from local drive.● Otherwise, Venus asks Vice for a copy of the file.● Vice remembers to notify Venus in case of file change on server● After communication, program again works with local cache only; commiting changes back to Vice● For the program, the file looks similar to a file on the local drive. In fact, the program works with a
UNIX file descriptor.
Program Venus Vice
Unix / Linux Kernel Unix / Linux Kernel
![Page 23: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/23.jpg)
Gluster FS
● Supported for Hadoop via Plugin (just a JAR)● Eliminates central NameNode● Fault-tolerant on File System Level● Works like a full FS (FUSE mountable, writes actual files)● Supports Striping● No changes to Hadoop/MapReduce code● Allows to run Hadoop over multiple Namespaces● Real Data Access through anything:
– FUSE (including Google Drive, Amazon Drive, AWS EBS, …)– NFS– SMB– SWIFT– and even HDFS ;-)
![Page 24: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/24.jpg)
GlusterFS Deployment Example
Management Server
AmbariSSH
GlusterFS console
YARN Master
YARN Resource ManagerJob History Server
1 2 3 4
5 6 7 8
9 10 11 12
Each worker running• YARN Node Manager• glusterd
![Page 25: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/25.jpg)
HDFS (Recall...)
1
1
2 3 4
NameNode
File
1 22
2
1
File Metadata and Access ismanaged by a dedicated NameNode
File contents is split into piecesSmaller than a predefined constant, e.g., 128 MB
Chunks are stored on a distributed system of data nodes with replication protecting from data loss due to expected node failures.
![Page 26: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/26.jpg)
XtreemFS
● The only DFS that handles all failure modes including network splits! A good candidate for the future...
● But still under development, try it out...
![Page 27: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/27.jpg)
Beyond File Systems
● File systems often implicitly have some assumptions, which are valid in most cases:– Files are read more often than they are written
– Files are written without conflicts at most times (e.g., file locking has to be done per application)
– Files have no structure beyond their size
– Files are seldom used partially (though striping is a remarkable example, remember city night lights in R)
![Page 28: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/28.jpg)
Big Data is not about Files
● However, Big Data has three properties– Volume
● Can easily be handled by files, though the whole-file assumption can become problematic
● Counter-Measure: Semantically Sensible Splitting Mechanism
– Variety● Sometimes, only non-evenly spaced information in a file is relevant.
However, this cannot be retrieved by the file read API
– Velocity● High Speed Updates to Data break the typically only one writer per
file assumption or lead to unmanagable amounts of small files
![Page 29: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/29.jpg)
Databases
In the era before Big Data, relational databases have been invented to mitigate problems arising from working with files.
RDBMS are structured in tables and typically provide● Atomicity
Either a whole transaction happens, or nothing● Consistency
Data can be constrained and linked between tables, for example, deleting a user can delete all associated data. Furthermore, anyone reading from the database at any time will see the same.
● IsolationConcurrent Transactions are guaranteed not to influence each other.
● DurabilityTransactions that happened are already persistent.
In short: RDBMS provide ACID semantics.
![Page 30: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/30.jpg)
However...
● There is no scalable way of providing ACID semantics over distributed systems.
● The CAP theorem states, that we can only have two out of – Consistency
– Availability
– Partition Tolerance
and classical RDBMS choose Consistency over Availability. (In fact, you have to wait ;-)
![Page 31: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/31.jpg)
NoSQL
● How can we go further?● We want
– Availability
– Partition Tolerance
– Multi-Writer
and are willling to sacrifice consistency (at least a bit)
![Page 32: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/32.jpg)
NoSQL by Example:Apache Cassandra
Classical Relational DBMS Apache CassandraHandles moderate incoming data velocity Handles high incoming data velocity
Data arriving from one/few locations Data arriving from many locations
Manages primarily structured data Manages all types of data
Supports complex/nested transactions Supports simple transactions
Single points of failure with failover No single points of failure; constant uptime
Supports moderate data volumes Supports very high data volumes
![Page 33: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/33.jpg)
NoSQL by Example:Apache Cassandra
Classical Relational DBMS Apache CassandraCentralized deployments Decentralized deployments
Data written in mostly one location Data written in many locations
Supports read scalability (with limited consistency)
Supports read and write scalability
Deployed in vertical scale up fashion Deployed in horizontal scale out fashion
![Page 34: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/34.jpg)
Cassandra Basic Structure
● The nodes of a Cassandra Key-Value Store form a simple headless ring.
● All nodes have equal functionality● Ring Communication is done
via a Gossip protocol
![Page 35: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/35.jpg)
Cassandra Writing
● Full Data Durability with High Performance: Commit Log, MemTable and SSTables:
![Page 36: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/36.jpg)
Cassandra Reading
● Reads are done using a structure called Bloom Filter to avoid useless communication / disk IO:
![Page 37: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/37.jpg)
Cassandra Replication
Replication is a central element in Cassandra to allow for nodes to join or leave the cluster at any time. Replication is influenced by four components:● Virtual Nodes
Assigns Data Ownership to Physical Machines● Partitioner
Partitions the Data● Replication Strategy
Defines, to which points in the ring replicas go● Snitch
Defines additional topology information, for example, from Cloud provider (such as Amazon Availability Zones)
![Page 38: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/38.jpg)
Partitioner
● All data in Cassandra is a key/value pair. ● In Cassandra tables, the key is derived from the table PRIMARY
KEY, which can even be a compound key of multiple columns.● A Token represents a range of Keys and every node is
responsible for certain ranges of keys.● The Partitioner takes the key and calculates a Token, where
the key falls in. This can be done in various ways:– Uniformly (OrderPreservingPartitioner)
– Random (RandomPartitioner)
– Fast Random (Murmur3Partitioner)
![Page 39: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/39.jpg)
Replication
● Data Replication in Cassandra is very simple, not to say minimalistic:– Simple Strategy: Find out the node holding the
token, where the Partitioner sent the current row and replicate it to those nodes around the ring, who don't have the data until ReplicationFactor number of nodes have the data.
– Network Topology Strategy: Find out the node holding the token. Then continue to place replica to hosts around the ring in different racks
![Page 40: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/40.jpg)
A Cassandra Ring
Ring without Vnodes, Replication Factor 3.
● Each Token stands for a Range of Keys● Each Node is responsible for a Token● Each Row is replicated around the ring to the next nodes
![Page 41: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/41.jpg)
Introducing Virtual Nodes (Vnodes)
● More Tokens, as each node is split into several virtual nodes● Flexible Movement of Tokens in case of failures● Efficient and Simple Load Balancing
![Page 42: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/42.jpg)
Bloom Filter
Storing Sets with Limited Memory
![Page 43: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/43.jpg)
Bloom Filter
● Cassandra Data is organized using Tokens calculated from row-key, which can be random– An MD5 hash of the key for RandomPartitioner
– A Murmur3 hash for Murmur3Partitioner (the default)
– The key itself for OrderPreservingPartitioner
● For Big Data, not all data can stay in the Memtable cache. Therefore, a query might need to visit SSTables.
● Apache Cassandra employs Bloom Filters to find out, which SSTables could contain a given key.
![Page 44: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/44.jpg)
Switch to BloomFilter.pdf
![Page 45: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/45.jpg)
Hadoop / Cassandra / MapReduce
● As Cassandra is just a highly efficient, user-friendly, headless storage engine, it can be used as data source and data sink for Apache Hadoop MapReduce jobs.
● While the TaskTracker has been able to use Data Locality in Cassandra, this feature has gone.
● However, you can still rely on the read and write everywhere structure of Cassandra or modify the TaskTracker to support your pattern of data locality.
![Page 46: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/46.jpg)
Apache Hive (Overview)
● There is another large Apache project called Hive, which is optimized for Data Warehousing.– In Data Warehouses, a lot of different data is being collected into a single DBMS for running
analytics on it.
– Typical Workload: Extract, Transform, Load (ETL)
● Apache Hive is a Data Warehouse on Top of Hadoop HDFS with the following key features:– Large support for SQL
– ODBC/JDBC Connections (e.g., Excel, Access)
– Execution of Queries are Compiled to MapReduce jobs automatically, or are run against Spark (not yet fully optimized, but already faster than MR)
– structure projection for unstructured data (no need to fill the data in again)● CSV / Text● Apache Parquet● Apache ORC
![Page 47: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/47.jpg)
Warning
● Though Hive supports many modern SQL features, it is not meant for using relational queries, for example, extensively joining or foreign keys.
● In fact, Hive should always be used in a denormalization way. Materializing Query results is not unwanted redundancy, it is highly efficient processing.
● Tip: Use Compression wherever supported, as I/O is the big data bottleneck and not CPU.
![Page 48: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/48.jpg)
Apache Big DataLow Level Projects
● Apache HDFS: Distributed File System● Apache Ignite: In-Memory Real-Time Processing● Apache MapReduce: MapReduce framework● Apache Pig: Data Flow Programming (similar to functional programming● Apache Spark: Solves similar problems as MapReduce, but faster using a functional
philosophy● Apache Storm: Stream Computation Framework● Apache Flink: Large Computation Graphs (DAG)● Apache HBase: The Hadoop Database● Apache Cassandra: The central Key-Value-Store ● Apache Hive: SQL over Hadoop / MapReduce● Apache Phoenix: SQL over Hbase● Apache Kafka: Publish-Subscribe Stream Processing● Apache Oozie: DAG Scheduler for multiple MapReduce jobs (e.g., trigger on availability)
![Page 49: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/49.jpg)
Apache Big DataApplication Level
● Apache Mahout: Scalable Machine Learning over MapReduce
● WEKA3: Allows for Data Mining using some Weka implementations
● Cloudera Oryx: Machine Learning with a Business Perspective
● Apache Spark MLLib: Machine Learning over Spark– Widest support: Java, Scala, Python, R
![Page 50: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/50.jpg)
Spatial Data Distribution
![Page 51: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/51.jpg)
Spatial Data Distribution
● Spatial Data Distribution is the central question of Geospatial Big Data– How do I distribute my data between nodes in
order to increase data locality?
● Two general strategies can be differentiated:– Ring-based Architectures: Similar to Cassandra
– Block-based Architectures: Similar to HDFS
![Page 52: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/52.jpg)
Ring vs. Block
● Ring-based Architectures:– Data is distributed between nodes according to an
ordering of the data
Example: The ordering in Cassandra is given by the key.
● Block-based Architectures– Data is first split into meaningful blocks, which are
then distributed between computing nodes.
![Page 53: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/53.jpg)
Ordering Spatial Data
● Central Idea: Space Filling Curves● A space-filling curve is a curve, often defined as
the limit of a sequence of curves, which provides a continuous map from the unit interval [0,1] to the unit square [0,1]x[0,1]
● We will most often use some real curves not from the limit, which visit all cells of a cell decomposition of the unit square.
![Page 54: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/54.jpg)
Peano Curve
Giuseppe Peano (1858-1932)
![Page 55: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/55.jpg)
Hilbert Curve
David Hilbert (1862-1943)
![Page 56: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/56.jpg)
Morton Order
Guy Macdonald Morton (1966)
![Page 57: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/57.jpg)
Properties
● Peano– Complex, medium locality
● Hilbert– Very Good locality
– Complex to project to and from
● Z-Curve– Good locality
– Easy to project into and from
– Constant time neighbors
– Spatially known: It is the same as a Depth-First traversal of a quadtree
![Page 58: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/58.jpg)
Geohash
● The Z-curve has been used to derive the Geohash mapping cells in the world to strings using Z-order and Base32 coding of the bit sequences.
![Page 59: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/59.jpg)
Geohash Cells
![Page 60: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/60.jpg)
High-Level Z Curve
● Encoding: Given a point P(X,Y) in some spatial reference system– Discretizer: First, discretize X and Y into positive integer numbers of fixed
length, e.g. 16bit.
– Zip: Mix the bits of both alternatingly into a new integer.
– Encode (Optional): Encode the bit string to get a concise, human-readable reresentation.
● Decoding: Given a string or integer– Decode (Optional): Decode Base32 to get an integer
– Dezip: Create two integers by dezipping the bits
– Reverse:● Either return the Center Point of the cell repesented by the given Z Curve key● Or return the bounding box.
![Page 61: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/61.jpg)
High-Level Z Curve
● Neighbors:– Given a Base32 string, neighbors can be calculated by
table lookup on character basis. See the various implementations of Geohash for details.
– Given an integer, use the formulas● top = ((z & 0b10101010) - 1 & 0b10101010) | (z &
0b01010101)● bottom = ((z | 0b01010101) + 1 & 0b10101010) | (z &
0b01010101)● left = ((z & 0b01010101) - 1 & 0b01010101) | (z & 0b10101010)● right = ((z | 0b10101010) + 1 & 0b01010101) | (z & 0b10101010)
![Page 62: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/62.jpg)
Assignment
● Implement or Download– A Zcurve encoder / decoder, possibly Geohash.
● Think about what would happen, if we:– Use the Geohash as the key for ordering points
– Use the OrderPreserving Strategy of Apache Cassandra
– Calculate the number of points within 1km distance for each point using MapReduce
● Measure– Whether the Zcurve gains speed for this query on a real cluster
![Page 63: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/63.jpg)
Block Distribution Strategies
![Page 64: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/64.jpg)
Spatial Indexing Structures
● Spatial Indexing Structures enable fast spatial access to points, typically supporting– Range Queries: Retrieve all geometries that are
within a specific Range, often a Rectangle or Sphere
– kNN Queries: Retrieve the k nearest neighbors to a point.
![Page 65: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/65.jpg)
Grid Index
● Overlay the data with a grid and collect all points falling into specific cells of the grid.– Specific Queries only need to get neighboring grids
– Tradeoff between● Number of Empty Cells● Number of Points in each Grid Cell
Good baseline index for (nearly) uniform data
![Page 66: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/66.jpg)
Example: BSP
● Binary Space Partitioning Tree (BSP)– For a given tree node, split all geometries at a
central hyperplane (line for 2D) and create two new nodes containing all geometries from the sides of the hyperplane.
● Central can be defined in various ways.
– Recurse, until a given tree node contains only few points.
![Page 67: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/67.jpg)
Example: Ball Trees
● Ball Tree– Each node in the tree represents a ball (circle in
2D) and all points falling into this ball.
– If there are too many points inside some ball, split the ball into two or more balls covering all points and build subtree nodes according to this splitting.
![Page 68: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/68.jpg)
R trees and R* trees
● R tree– Each node in the tree represents the Minimum
Bounding Rectangle (MBR) of all points inside (or below) this node.
– If a node contains too many points, split it into two MBRs with similarly many points.
● R* tree – Advanced insertion algorithm making the tree one of the
most efficient and widely used spatial indices for nearest neighbor queries.
![Page 69: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/69.jpg)
Lifting those Indizes to Hadoop?
● A matter of much work and open research– Hotspots can easily arise for simple indizes such as
grids
– Non-Uniform access patterns / varying chunk sizes can be results
– Efficient spatial replication and data locality has not yet been fully thought through...
Open Research Topic, possibly worth some Master theses, if you are interested.
SpatialHadoop is one approach...
![Page 70: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/70.jpg)
Spatial Block Distribution
![Page 71: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/71.jpg)
SpatialHadoop
● Uses spatial indizes for distributing blocks in HDFS, hence, jobs can be put to nodes that already have the data locally.
● MapReduce Components– SpatialFileSplitter
– SpatialRecordReader
● Query Components– Range Queries– kNN Queries– Spatial Joins
![Page 72: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/72.jpg)
Distributed Index Model
![Page 73: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/73.jpg)
Data Mining from Location
Classification and Clustering Based on Point Distance
![Page 74: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/74.jpg)
Data Mining
● Data Mining is the process of extracting structures and information from large bodies of data.
● Data Mining can be split into – Supervised Approaches
– Unsupervised Approaches
![Page 75: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/75.jpg)
Supervised Data Mining
● In Supervised Data Mining, a given dataset contains the intended result.
● This dataset can be used as a training dataset in order to extract a structural representation of the data
● This structural representation is called model and can be applied to unknown data assigning what the model would predict
![Page 76: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/76.jpg)
Example: Classification
● Given a table with k attributes and a nominal class variable, learn to infer the class from the attributes.
● Classical Example: Iris Dataset– Question: Can we infer from four measurements of
sizes of a flower its species?
![Page 77: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/77.jpg)
The IRIS Dataset
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. _Annals of Eugenics_, *7*, Part II, 179-188.
![Page 78: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/78.jpg)
![Page 79: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/79.jpg)
1NN classificationA very simple Classification scheme● Nearest Neighbor Classification is a very simple
though spatially quite useful classification algorithm
● It takes all training data and assigns to a location the class of the nearest training instance.
● Let us look at how it works:
![Page 80: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/80.jpg)
kNN in R
data(iris)# Select 10 random training rowstrain = iris[sample(1:nrow(iris),10),]# Apply 1nn on trainingset to full iris setres = knn(train[,1:4],iris[,1:4],train$Species,1)summary(res==iris$Species) Mode FALSE TRUE NA's logical 16 134 0
plot(iris$Petal.Length,iris$Petal.Width, col=iris$Species)points(Petal.Width~Petal.Length, data=iris[which(iris$Species != res),], pch=4, cex=2, col="red", lwd=2)
![Page 81: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)](https://reader034.vdocuments.site/reader034/viewer/2022050419/5f8ec21c674f6a35ee3a23f2/html5/thumbnails/81.jpg)