overview of bigdata (hadoop & spark)

APACHE HADOOP & SPARK ONLINE TRAINING

byVenu A Positive

www.bigdataanalyst.in

OBJECTIVES OF THIS TRAINING

What is Bigdata How Hadoop & Spark solve Bigdata problems? Hadoop ecosystems Spark Ecosystems Hadoop Vs Spark Power of Scala language

LEARN HADOOP & SPARK

WHAT IS BIGDATA?

A data problem which not resolved by traditional processing systems called BigData.

Big Data is a buzzword to collect a lot of data sets to analyze.

Traditional systems unable to process such complex data sets.

BigData can ease such problems by using different ecosystems.

WHY BIGDATA?

Daily Many servers are generating a lot of data.

Daily users generating a large amount of Mobile data.

Daily Facebook, Twitter generating a lot of social media data.

Predict the weather/business for better agriculture/business.

How to analyze such data?

COMMON BIGDATA PROBLEMS

Storage the data in reliable manner, Process the data parallelly, Analyze the data quickly.

Hadoop and spark are key tools to resolve those common bigdata problems.

WHAT IS HADOOP?

It's Batch processing file system, so it's execute a series of programs ("jobs") without manual intervention. But it's take a lot of time to process large amount of data.

Hadoop is a free tool to store in reliable manner and process a large amount of data parallelly.

It's opensource, so you can store, process for free in your laptop.

HDFS and Mapreduce are core components in Hadoop.HDFS: To store data Mapreduce: To process data

HADOOP ECOSYSTEMS HDFS: Used to Store data

MapReduce: run Java commands allows

other tools to process Hive: To process Datawrehouse

applications. Sqoop used to get RDBMS

data. Pig: To process large

amount of unstructured data. Sqoop, Flume (log data) used to get

data from external resources

Hbase used to update existing data & to process Quickly

Ambari is Administration tool to monitor these tools

WHAT IS SPARK?

A computing system to process a large amount of data by using different ecosystems Like SparkSQL, Streaming, MLib.

SparkSQL for structured Data, Spark Core for unstructured data, Spark Streaming for Real-time Data.

Provides Java, Scala, Python, R APIs for user convenience.Spark is (polyglot) Unified system to process different datasets at a time.

You can process Hive, Json data, SQL, Hadoop data at a time. Processing Speed is power of spark. You can process any type

of data (batch processing, streaming data, Iteration data...)

SPARK ECOSYSTEMS

HADOOP VS SPARK

Most of the companies prefer spark instead of Hadoop. Let example, If you have 1TB data, If you run select * from tablename, it's take minutes of time for a single query. So it's too annoying process. Query processing is too important in a production environment.

Hadoop can process only Batch processing, but not Graph, Real-time streaming, Iterative, User interaction data. It means you can run any Hadoop applications in a Spark without any code modification. So Most of the companies directly dig into Spark instead of Hadoop.

WHAT ARE THE PRE REQUIRED SKILLS TO LEARN HADOOP/SPARK?

Core Java or OOPs concept skills for Hadoop and scale or Python language skills is mandatory to learn Spark.

Min 4GB ram laptop, 50GB hard disk highly recommended to learn.

If you want to learn Spark directly, At least you should have aware of the HDFS architecture (not required Map-reduce or Hadoop experience)

SCALA HIGHLY RECOMMENDED TO LEARN SPARK

Scala can support any type of Spark applications, but also allows Python, R, Java and more, but Scala highly recommended which support any type of application.Most of the cases Python also support all Libraries like Spark Streaming, GraphX, SparkSQL and MLib, but very rare cases doesn't support. In Spark 1.5 Python Support all applications.

SCALA IS HARD TO LEARN? No, It's easier than Java. Scala is inherited from Java only. If you know core Java

it's easy to understand Scala. If you spend one hour to practice, within 40 days you will become expert in Scala.

It's a functional language, but 100% support OOPS. The power of Scala in Classes & Traits. Python also 90% similar to Scala.

Practical experience is too important in Scala. Practice makes prefect in Big Data. If you write 100 lines in Java, in Scala you write just 4 lines maximum. It's

simplified Java Code. It's best suitable for Bigdata frameworks like Spark, Akka, Techyon etc..

If you have plans to switch to Data Scientist/ machine learning languages, learn Python instead of Scala.

NOTES TO LEARN SPARK

If you consider Hadoop as C language, Spark is Java language. No need to pay single rupee, If you have eager to learn you

can learn through online. If you are self learner, I am sharing A to Z knowledge through

my blog www.bigdataanalyst.in or www.apachespark.in via YouTube. Also mail external resources.

If you want to learn MLLib, Spark R, or Spark Optimization process, waits few months, or switch to a paid service.

http://www.bigdataanalyst.in/

http://www.apachespark.in/

https://www.youtube.com/channel/UCCGoM_sk2UGIiaTdtG3tHBw

ANY QUERIES & QUESTIONS?Contact me directly Facebook: https://www.facebook.com/BigDataAnalyst Twitter: https://www.twitter.com/bigdataanalyst1 YouTube: http://www.youtube.com/bigdataanalystin Linkedin: https://in.linkedin.com/in/venukatragadda SlideShare: https://www.slideshare.net/bigdataanalyst Website: www.apachespark.in blog: www.bigdataanalyst.in Email: [email protected] or [email protected]

https://www.facebook.com/BigDataAnalyst

https://www.twitter.com/bigdataanalyst1

http://www.youtube.com/bigdataanalystin

https://in.linkedin.com/in/venukatragadda

https://www.slideshare.net/bigdataanalysbt

http://www.apachespark.in/

http://www.bigdataanalyst.in/

mailto:[email protected]

mailto:[email protected]

overview of bigdata (hadoop & spark)

Data & Analytics