cs435 introduction to big data - colorado state...

GTAs: Paahuni Khandelwal and Mohamed Chaabane Email: [email protected]

August 30, 2019

[Recitation 0]

CS435 - Introduction to Big Data

mailto:[email protected]

Agenda

• Go over course introduction• Configure Hadoop in standalone mode (personal laptop)• Setup Hadoop Cluster• Running program from IDE• Running simple word count program using Hadoop

2CS 435: Introduction to Big Data [Fall 2019]

Assignment Schedule

• Assignment description, due date and test files available at - http://www.cs.colostate.edu/~cs435/Assignments.html• 3 programming assignments – Each carries 10 points• Programming Assignment 1 comprises of PA0+ PA1 (3+7 = 10 points)• Submission through canvas + Demo (includes viva)• Feel free to bring laptops during help sessions• PA0 submission on 5th September by 5.00pm

!3CS 435: Introduction to Big Data [Fall 2019]

http://www.cs.colostate.edu/~cs435/Assignments.html

Score Distribution

• PA0 is for 3 points– 1 point for setting up Hadoop cluster

– 1 point for performing basic HDFS operations– 1 point running word count example

• 1 point for running in cluster mode • 0.5 point for running only in standalone mode


Help Sessions

• (Optional) Will be held every Friday from 4.00pm to 4.50pm in CS130• Office hours

• Paahuni M/W: 2.00 to 4.00 • Mohamed Monday: 2.00 to 3.00 and Friday: 3.00 to 4.00 in CS120

• Please use piazza for any technical problem• Recording available on http://www.cs.colostate.edu/~cs435/Assignments.html


http://www.cs.colostate.edu/~cs435/Assignments.html

DEMO


Apache Hadoop

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets» Computing takes place on nodes with data on local disks» Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster

• Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data

• Hadoop YARN: A framework for job scheduling and cluster resource management» Tracks the total live nodes and resources on the cluster » Manages the allocation task of these resources

More Details:- https://hadoop.apache.org/


https://hadoop.apache.org/

Word Count problem using MapReduce


Map Reduce Cluster mode

!9

https://intellipaat.com/blog/tutorial/hadoop-tutorial/mapreduce-yarn/

CS 435: Introduction to Big Data [Fall 2019]

https://intellipaat.com/blog/tutorial/hadoop-tutorial/mapreduce-yarn/

Important Points for Hadoop setup

• All the hdfs_paths starts with / (consider hdfs as separate disk)

• You don’t need to download Hadoop binary/java in CS120 machines. It’s already setup there.

• Use unique nodes/ports while configuring Hadoop Cluster

• Make sure the output folder specified while running your job does not exists. • Avoid formatting NameNode every time. It deletes all the files in your HDFS.

– Recommended: Configuring HDFS for first time or if absolutely needed

• The number of map task by default is equal to the number of input splits which, by default, is 128MB (size of HDFS block)

• Don’t forget to export HADOOP_CLASSPATH. – Or you can add : export HADOOP_CLASSPATH=“${JAVA_HOME}/lib/tools.jar" to your bashrc file


HDFS basic commands

• To create new directory(including parent directory) in HDFS: – $HADOOP_HOME/bin/hadoop fs –mkdir /PA0/input

• Display a list of contents of a directory in HDFS:– Non-recursively: $HADOOP_HOME/bin/hadoop fs -ls /PA0/input– Recursively: $HADOOP_HOME/bin/hadoop fs -ls –R /

• Copy file from local system into HDFS directory created in first step:– $HADOOP_HOME/bin/hadoop fs –put <local_source_file_path> /PA0/input

• Copy files/folders back in local system from HDFS– $HADOOP_HOME/bin/hadoop fs –get /PA0/output/part_r_00000.txt <local_destination>


HDFS basic commands – CONT.

• Display contents of the file copied in HDFS on console or stdout – $HADOOP_HOME/bin/hadoop fs –cat PA0/input/input.txt

• Copy file/directory from source to destination within HDFS– $HADOOP_HOME/bin/hadoop fs –cp <source_path> <destination_path>

• Moves file/directory from source to destination within HDFS– $HADOOP_HOME/bin/hadoop fs –mv <source_path> <destination_path>

More information - http://images.linoxide.com/hadoop-hdfs-commands-cheatsheet.pdf


http://images.linoxide.com/hadoop-hdfs-commands-cheatsheet.pdf

Submission

• Tarball containing Hadoop configuration file includes- core-site.xml, pdfs-site.xml, mapped-site.xml, yarn-site.xml, workers and slaves

• Jar file containing the class file for Word count code• Non-mapreduce word count program • Output files


Thank you

!14

cs435 introduction to big data - colorado state...

Documents