cs435 introduction to big data - colorado state...

14
GTAs: Paahuni Khandelwal and Mohamed Chaabane Email: [email protected] August 30, 2019 [Recitation 0] CS435 - Introduction to Big Data

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

GTAs: Paahuni Khandelwal and Mohamed Chaabane Email: [email protected]

August 30, 2019

[Recitation 0]

CS435 - Introduction to Big Data

Page 2: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Agenda

• Go over course introduction• Configure Hadoop in standalone mode (personal laptop)• Setup Hadoop Cluster• Running program from IDE• Running simple word count program using Hadoop

2CS 435: Introduction to Big Data [Fall 2019]

Page 3: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Assignment Schedule

• Assignment description, due date and test files available at - http://www.cs.colostate.edu/~cs435/Assignments.html• 3 programming assignments – Each carries 10 points• Programming Assignment 1 comprises of PA0+ PA1 (3+7 = 10 points)• Submission through canvas + Demo (includes viva)• Feel free to bring laptops during help sessions• PA0 submission on 5th September by 5.00pm

!3CS 435: Introduction to Big Data [Fall 2019]

Page 4: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Score Distribution

• PA0 is for 3 points– 1 point for setting up Hadoop cluster

– 1 point for performing basic HDFS operations– 1 point running word count example

• 1 point for running in cluster mode • 0.5 point for running only in standalone mode

!4CS 435: Introduction to Big Data [Fall 2019]

Page 5: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Help Sessions

• (Optional) Will be held every Friday from 4.00pm to 4.50pm in CS130• Office hours

• Paahuni M/W: 2.00 to 4.00 • Mohamed Monday: 2.00 to 3.00 and Friday: 3.00 to 4.00 in CS120

• Please use piazza for any technical problem• Recording available on http://www.cs.colostate.edu/~cs435/Assignments.html

!5CS 435: Introduction to Big Data [Fall 2019]

Page 6: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

DEMO

!6CS 435: Introduction to Big Data [Fall 2019]

Page 7: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Apache Hadoop

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets» Computing takes place on nodes with data on local disks» Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster

• Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data

• Hadoop YARN: A framework for job scheduling and cluster resource management» Tracks the total live nodes and resources on the cluster » Manages the allocation task of these resources

More Details:- https://hadoop.apache.org/

!7CS 435: Introduction to Big Data [Fall 2019]

Page 8: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Word Count problem using MapReduce

!8CS 435: Introduction to Big Data [Fall 2019]

Page 9: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Map Reduce Cluster mode

!9

https://intellipaat.com/blog/tutorial/hadoop-tutorial/mapreduce-yarn/

CS 435: Introduction to Big Data [Fall 2019]

Page 10: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Important Points for Hadoop setup

• All the hdfs_paths starts with / (consider hdfs as separate disk)

• You don’t need to download Hadoop binary/java in CS120 machines. It’s already setup there.

• Use unique nodes/ports while configuring Hadoop Cluster

• Make sure the output folder specified while running your job does not exists. • Avoid formatting NameNode every time. It deletes all the files in your HDFS.

– Recommended: Configuring HDFS for first time or if absolutely needed

• The number of map task by default is equal to the number of input splits which, by default, is 128MB (size of HDFS block)

• Don’t forget to export HADOOP_CLASSPATH. – Or you can add : export HADOOP_CLASSPATH=“${JAVA_HOME}/lib/tools.jar" to your bashrc file

!10CS 435: Introduction to Big Data [Fall 2019]

Page 11: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

HDFS basic commands

• To create new directory(including parent directory) in HDFS: – $HADOOP_HOME/bin/hadoop fs –mkdir /PA0/input

• Display a list of contents of a directory in HDFS:– Non-recursively: $HADOOP_HOME/bin/hadoop fs -ls /PA0/input– Recursively: $HADOOP_HOME/bin/hadoop fs -ls –R /

• Copy file from local system into HDFS directory created in first step:– $HADOOP_HOME/bin/hadoop fs –put <local_source_file_path> /PA0/input

• Copy files/folders back in local system from HDFS– $HADOOP_HOME/bin/hadoop fs –get /PA0/output/part_r_00000.txt <local_destination>

!11CS 435: Introduction to Big Data [Fall 2019]

Page 12: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

HDFS basic commands – CONT.

• Display contents of the file copied in HDFS on console or stdout – $HADOOP_HOME/bin/hadoop fs –cat PA0/input/input.txt

• Copy file/directory from source to destination within HDFS– $HADOOP_HOME/bin/hadoop fs –cp <source_path> <destination_path>

• Moves file/directory from source to destination within HDFS– $HADOOP_HOME/bin/hadoop fs –mv <source_path> <destination_path>

More information - http://images.linoxide.com/hadoop-hdfs-commands-cheatsheet.pdf

!12CS 435: Introduction to Big Data [Fall 2019]

Page 13: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Submission

• Tarball containing Hadoop configuration file includes- core-site.xml, pdfs-site.xml, mapped-site.xml, yarn-site.xml, workers and slaves

• Jar file containing the class file for Word count code• Non-mapreduce word count program • Output files

!13CS 435: Introduction to Big Data [Fall 2019]

Page 14: CS435 Introduction to Big Data - Colorado State Universitycs435/PA/PA0/Fall2019/Recitation0.pdfApache Hadoop • Hadoop MapReduce: A YARN-based system for parallel processing of large

Thank you

!14