overview of bigdata (hadoop & spark)

16
APACHE HADOOP & SPARK ONLINE TRAINING by Venu A Positive www.bigdataanalyst.in

Upload: venu-katragadda

Post on 14-Apr-2017

250 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Overview of Bigdata (Hadoop & Spark)

APACHE HADOOP & SPARK ONLINE TRAINING

byVenu A Positive

www.bigdataanalyst.in

Page 2: Overview of Bigdata (Hadoop & Spark)

OBJECTIVES OF THIS TRAINING

What is Bigdata How Hadoop & Spark solve Bigdata problems? Hadoop ecosystems Spark Ecosystems Hadoop Vs Spark Power of Scala language

Page 3: Overview of Bigdata (Hadoop & Spark)

LEARN HADOOP & SPARK

Page 4: Overview of Bigdata (Hadoop & Spark)

WHAT IS BIGDATA?

A data problem which not resolved by traditional processing systems called BigData.

Big Data is a buzzword to collect a lot of data sets to analyze.

Traditional systems unable to process such complex data sets.

BigData can ease such problems by using different ecosystems.

Page 5: Overview of Bigdata (Hadoop & Spark)

WHY BIGDATA?

Daily Many servers are generating a lot of data.

Daily users generating a large amount of Mobile data.

Daily Facebook, Twitter generating a lot of social media data.

Predict the weather/business for better agriculture/business.

How to analyze such data?

Page 6: Overview of Bigdata (Hadoop & Spark)

COMMON BIGDATA PROBLEMS

Storage the data in reliable manner, Process the data parallelly, Analyze the data quickly.

Hadoop and spark are key tools to resolve those common bigdata problems.

Page 7: Overview of Bigdata (Hadoop & Spark)

WHAT IS HADOOP?

It's Batch processing file system, so it's execute a series of programs ("jobs") without manual intervention. But it's take a lot of time to process large amount of data.

Hadoop is a free tool to store in reliable manner and process a large amount of data parallelly.

It's opensource, so you can store, process for free in your laptop.

HDFS and Mapreduce are core components in Hadoop.HDFS: To store data Mapreduce: To process data

Page 8: Overview of Bigdata (Hadoop & Spark)

HADOOP ECOSYSTEMS HDFS: Used to Store data

MapReduce: run Java commands allows

other tools to process Hive: To process Datawrehouse

applications. Sqoop used to get RDBMS

data. Pig: To process large

amount of unstructured data. Sqoop, Flume (log data) used to get

data from external resources

Hbase used to update existing data & to process Quickly

Ambari is Administration tool to monitor these tools

Page 9: Overview of Bigdata (Hadoop & Spark)

WHAT IS SPARK?

A computing system to process a large amount of data by using different ecosystems Like SparkSQL, Streaming, MLib.

SparkSQL for structured Data, Spark Core for unstructured data, Spark Streaming for Real-time Data.

Provides Java, Scala, Python, R APIs for user convenience.Spark is (polyglot) Unified system to process different datasets at a time.

You can process Hive, Json data, SQL, Hadoop data at a time. Processing Speed is power of spark. You can process any type

of data (batch processing, streaming data, Iteration data...)

Page 10: Overview of Bigdata (Hadoop & Spark)

SPARK ECOSYSTEMS

Page 11: Overview of Bigdata (Hadoop & Spark)

HADOOP VS SPARK

Most of the companies prefer spark instead of Hadoop. Let example, If you have 1TB data, If you run select * from tablename, it's take minutes of time for a single query. So it's too annoying process. Query processing is too important in a production environment.

Hadoop can process only Batch processing, but not Graph, Real-time streaming, Iterative, User interaction data. It means you can run any Hadoop applications in a Spark without any code modification. So Most of the companies directly dig into Spark instead of Hadoop.

Page 12: Overview of Bigdata (Hadoop & Spark)

WHAT ARE THE PRE REQUIRED SKILLS TO LEARN HADOOP/SPARK?

Core Java or OOPs concept skills for Hadoop and scale or Python language skills is mandatory to learn Spark.

Min 4GB ram laptop, 50GB hard disk highly recommended to learn.

If you want to learn Spark directly, At least you should have aware of the HDFS architecture (not required Map-reduce or Hadoop experience)

Page 13: Overview of Bigdata (Hadoop & Spark)

SCALA HIGHLY RECOMMENDED TO LEARN SPARK

Scala can support any type of Spark applications, but also allows Python, R, Java and more, but Scala highly recommended which support any type of application.Most of the cases Python also support all Libraries like Spark Streaming, GraphX, SparkSQL and MLib, but very rare cases doesn't support. In Spark 1.5 Python Support all applications.

Page 14: Overview of Bigdata (Hadoop & Spark)

SCALA IS HARD TO LEARN? No, It's easier than Java. Scala is inherited from Java only. If you know core Java

it's easy to understand Scala. If you spend one hour to practice, within 40 days you will become expert in Scala.

It's a functional language, but 100% support OOPS. The power of Scala in Classes & Traits. Python also 90% similar to Scala.

Practical experience is too important in Scala. Practice makes prefect in Big Data. If you write 100 lines in Java, in Scala you write just 4 lines maximum. It's

simplified Java Code. It's best suitable for Bigdata frameworks like Spark, Akka, Techyon etc..

If you have plans to switch to Data Scientist/ machine learning languages, learn Python instead of Scala.

Page 15: Overview of Bigdata (Hadoop & Spark)

NOTES TO LEARN SPARK

If you consider Hadoop as C language, Spark is Java language. No need to pay single rupee, If you have eager to learn you

can learn through online. If you are self learner, I am sharing A to Z knowledge through

my blog www.bigdataanalyst.in or www.apachespark.in via YouTube. Also mail external resources.

If you want to learn MLLib, Spark R, or Spark Optimization process, waits few months, or switch to a paid service.

Page 16: Overview of Bigdata (Hadoop & Spark)

ANY QUERIES & QUESTIONS?Contact me directly Facebook: https://www.facebook.com/BigDataAnalyst Twitter: https://www.twitter.com/bigdataanalyst1 YouTube: http://www.youtube.com/bigdataanalystin Linkedin: https://in.linkedin.com/in/venukatragadda SlideShare: https://www.slideshare.net/bigdataanalyst Website: www.apachespark.in blog: www.bigdataanalyst.in Email: [email protected] or [email protected]