1052scbda03 social computing and big data...
TRANSCRIPT
SocialComputingandBigDataAnalytics
社群運算與大數據分析
1
1052SCBDA03MISMBA(M2226)(8606)
Wed,8,9,(15:10-17:00)(B505)
Min-Yuh Day戴敏育
Assistant Professor專任助理教授
Dept. of Information Management, Tamkang University淡江大學資訊管理學系
http://mail. tku.edu.tw/myday/2017-03-01
TamkangUniversity
TamkangUniversity
巨量資料基礎:MapReduce典範、Hadoop與Spark生態系統
(Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem)
週次 (Week)日期 (Date)內容 (Subject/Topics)12017/02/15CourseOrientationforSocialComputingand
BigDataAnalytics(社群運算與大數據分析課程介紹)
22017/02/22DataScienceandBigDataAnalytics:Discovering,Analyzing,VisualizingandPresentingData(資料科學與大數據分析:探索、分析、視覺化與呈現資料)
32017/03/01FundamentalBigData:MapReduceParadigm,HadoopandSparkEcosystem(大數據基礎:MapReduce典範、Hadoop與Spark生態系統)
課程大綱 (Syllabus)
2
週次 (Week)日期 (Date)內容 (Subject/Topics)42017/03/08BigDataProcessingPlatformswithSMACK:
Spark,Mesos,Akka,CassandraandKafka(大數據處理平台SMACK:Spark,Mesos,Akka,Cassandra,Kafka)
52017/03/15BigDataAnalyticswithNumpy inPython(PythonNumpy大數據分析)
62017/03/22FinanceBigDataAnalyticswithPandasinPython(PythonPandas財務大數據分析)
72017/03/29TextMiningTechniquesandNaturalLanguageProcessing(文字探勘分析技術與自然語言處理)
82017/04/05Off-campusstudy(教學行政觀摩日)
課程大綱 (Syllabus)
3
週次 (Week)日期 (Date)內容 (Subject/Topics)92017/04/12SocialMediaMarketingAnalytics
(社群媒體行銷分析)102017/04/19期中報告 (MidtermProjectReport)112017/04/26DeepLearningwithTheano andKeras inPython
(PythonTheano和 Keras深度學習)122017/05/03DeepLearningwithGoogleTensorFlow
(GoogleTensorFlow深度學習)132017/05/10SentimentAnalysisonSocialMediawith
DeepLearning(深度學習社群媒體情感分析)
課程大綱 (Syllabus)
4
週次 (Week)日期 (Date)內容 (Subject/Topics)142017/05/17SocialNetworkAnalysis(社會網絡分析)152017/05/24MeasurementsofSocialNetwork(社會網絡量測)162017/05/31ToolsofSocialNetworkAnalysis
(社會網絡分析工具)172017/06/07FinalProjectPresentationI(期末報告 I)182017/06/14FinalProjectPresentationII(期末報告 II)
課程大綱 (Syllabus)
5
2017/03/01巨量資料基礎:MapReduce典範、
Hadoop與Spark生態系統(FundamentalBigData:
MapReduceParadigm,HadoopandSparkEcosystem)
6
ArchitectureofBigDataAnalytics
7Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications
DataMining
OLAP
Reports
QueriesHadoopMapReduce
PigHiveJaql
ZookeeperHbase
CassandraOozieAvro
MahoutOthers
Middleware
ExtractTransform
Load
DataWarehouse
TraditionalFormat
CSV,Tables
*Internal
*External
*Multipleformats
*Multiplelocations
*Multipleapplications
BigDataSources
BigDataTransformation
BigDataPlatforms&Tools
BigDataAnalytics
Applications
BigDataAnalytics
TransformedData
RawData
BusinessIntelligence(BI)Infrastructure
8Source:KennethC.Laudon&JaneP.Laudon(2014),ManagementInformationSystems:ManagingtheDigitalFirm,ThirteenthEdition,Pearson.
SAS®WithintheHADOOPECOSYSTEM
9
Impala
Next-GenSAS® User
User Interface
Metadata
Data Access
DataProcessing
FileSystem
SAS® User
MPI Based
SAS® LASR™AnalyticServer
SAS®High-Performance
AnalyticProcedures
HDFS
BaseSAS&SAS/ACCESS®toHadoop™
SASMetadata
Pig
MapReduce
In-MemoryDataAccess
SAS® Visual Analytics
SAS®
Enterprise Miner™
SAS® Data Integration
SAS®
EnterpriseGuide®
HiveSASEmbedded
ProcessAccelerators
SAS® In-Memory Statistics for
Haodop
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
EG EM VA
FundamentalBigData:MapReduceParadigm,HadoopandSpark
Ecosystem
10
11Source: https://www.thalesgroup.com/en/worldwide/big-data/big-data-big-analytics-visual-analytics-what-does-it-all-mean
MapReduceParadigm
12
MapReduceParadigm
13
BigData
Map0 Map1 Map2 Map3
Reduce0 Reduce1 Reduce2 Reduce3
Map
ReduceMapReduceData
OutputData
14Source: https://www.edureka.co/blog/mapreduce-tutorial/
DogLoveCatBirdLoveBirdDogBirdCat
Input
MapReduceWordCount
15Source: https://www.edureka.co/blog/mapreduce-tutorial/
DogLoveCatBirdLoveBirdDogBirdCat
Input
Bird,3Cat,2Dog,2Love,2
MapReduceWordCountOutput
16Source: https://www.edureka.co/blog/mapreduce-tutorial/
DogLoveCatBirdLoveBirdDogBirdCat
DogLoveCat
BirdLoveBird
DogBirdCat
Input
Dog,1Love,1Cat,1
Bird,1Love,1Bird,1
Dog,1Bird,1Cat,1
Bird,(1,1,1)
Cat,(1,1)
Dog,(1,1)
Love,(1,1)
Bird,3
Cat,2
Dog,2
Love,2
Bird,3Cat,2Dog,2Love,2
MapReduceWordCountOutputSplit Map Shuffle Reduce
HadoopEcosystem
17
TheApache™Hadoop®projectdevelopsopen-sourcesoftware
forreliable,scalable,distributedcomputing.
18Source: http://hadoop.apache.org/
19
HDFS
MapReduce Processing
Storage
Source: http://hadoop.apache.org/
BigDatawithHadoopArchitecture
20Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
21
BigDatawithHadoopArchitectureLogicalArchitectureProcessing:MapReduce
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
22
BigDatawithHadoopArchitectureLogicalArchitecture
Storage:HDFS
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
23
BigDatawithHadoopArchitectureProcessFlow
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
24
BigDatawithHadoopArchitectureHadoopCluster
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
25
HadoopEcosystem
Source: https://savvycomsoftware.com/what-you-need-to-know-about-hadoop-and-its-ecosystem/
HadoopEcosystem
26Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
HDP(HortonworksDataPlatform)ACompleteEnterpriseHadoopDataPlatform
27Source: http://hortonworks.com/hdp/
ApacheHadoopHortonworks DataPlatform
28Source: http://hortonworks.com/hdp/
HadoopandDataAnalyticsTools
29Source: http://hortonworks.com/hdp/
Hadoop1à Hadoop2
30Source: http://hortonworks.com/hadoop/tez/
BigDataSolution
31Source: http://www.newera-technologies.com/big-data-solution.html
EG EM VA
TraditionalETLArchitecture
32Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
33Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
OffloadETLwithHadoop(BigDataArchitecture)
SparkEcosystem
34
ApacheSparkisafastandgeneralengine
forlarge-scaledataprocessing.
35
Lightning-fast cluster computing
Source: http://spark.apache.org/
LogisticregressioninHadoopandSpark
36
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Source: http://spark.apache.org/
EaseofUse
• WriteapplicationsquicklyinJava,Scala,Python,R.
37Source: http://spark.apache.org/
WordcountinSpark'sPythonAPI
text_file=spark.textFile("hdfs://...")
text_file.flatMap(lambdaline:line.split()).map(lambdaword:(word,1)).reduceByKey(lambdaa,b:a+b)
38Source: http://spark.apache.org/
SparkandHadoop
39Source: http://spark.apache.org/
SparkEcosystem
40Source: http://spark.apache.org/
SparkEcosystem
41Source: https://databricks.com/spark/about
SparkEcosystem
42Source: Mike Frampton (2015), Mastering Apache Spark, Packt Publishing
Spark
GraphX(graph)
SparkSQL
MLlib(machinelearning)
SparkStreaming
Kafka Flume H2O Hive
Cassandra
Titan
HBase
HDFS
SMACK Stack
43
• Spark– fast and general engine for distributed, large-scale data
processing
• Mesos– cluster resource management system that provides efficient
resource isolation and sharing across distributed applications
• Akka– a toolkit and runtime for building highly concurrent, distributed,
and resilient message-driven applications on the JVM
• Cassandra– distributed, highly available database designed to handle large
amounts of data across multiple datacenters
• Kafka– a high-throughput, low-latency distributed messaging system
designed for handling real-time data feedsSource:AntonKirillov (2015),DataprocessingplatformsarchitectureswithSpark,Mesos,Akka,CassandraandKafka,BigDataAWMeetup
Hadoopvs.Spark
44Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
Iter.1
Iter.1
Iter.2
Iter.2
Input
Input
HDFSread
HDFSread
HDFSwrite
HDFSwrite
HadoopDistribution
• ApacheHadoop– http://hadoop.apache.org/
• AmazonElasticMapReduce(EMR)– https://aws.amazon.com/emr/
• ClouderaCDH– https://www.cloudera.com/downloads.html
• HortonworksSandbox– https://hortonworks.com/products/sandbox/
45
StepstoInstallHadoop
onaPersonalComputer(Windows/OSX)
46Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
Hodoop:LinuxBasedSoftware
47
LINUX
LINUX
LINUX
LINUX
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
Appliance
48
HadoopLinux
Virtual Machine (VirtualBox / VMWare)
Personal Computer (Windows / OS X)
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
ConnectiontoHadoop
49
HadoopLinux
Virtual Machine (VirtualBox / VMWare)
Personal Computer (Windows / OS X)Browser
Accessfromhost
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
StepstoInstallHadooponaPersonalComputer(Windows/OSX)
50Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
Step1.DownloadandInstallVirtualBox
Step2.DownloadAppliance
Step3.ImportAppliance
Step4.ConfigureVirtualMachine(VM)
Step5.StartVirtualMachine(VM)
Step6.TestConnectionFromHost
VirtualBox
51https://www.virtualbox.org/
StepstoInstallHadooponaPersonalComputer(Windows/OSX)
52Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
Step1.DownloadandInstallVirtualBox
Step2.DownloadAppliance
Step3.ImportAppliance
Step4.ConfigureVirtualMachine(VM)
Step5.StartVirtualMachine(VM)
Step6.TestConnectionFromHost
Hortonworks Sandbox
HortonworksSandboxTheeasiestwaytogetstartedwithEnterpriseHadoop
53http://hortonworks.com/products/hortonworks-sandbox/#install
GetstartedonHadoopwiththesetutorialsbasedontheHortonworksSandbox
54http://hortonworks.com/tutorials/
ApacheHadoop
55http://hadoop.apache.org/
56
ApacheHadoophttp://hadoop.apache.org/releases.html#Download
ApacheHadoopYARN
57Source: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
ApacheSpark
58http://spark.apache.org/
59Source: http://mattturck.com/2016/02/01/big-data-landscape/
References• EMCEducationServices(2015),
DataScienceandBigDataAnalytics:Discovering,Analyzing,VisualizingandPresentingData,Wiley
• ShivaAchari(2015),HadoopEssentials- TacklingtheChallengesofBigDatawithHadoop,PacktPublishing
• MikeFrampton(2015),MasteringApacheSpark,PacktPublishing
• DeepakRamanathan(2014),SASModernizationarchitectures- BigDataAnalytics,http://www.slideshare.net/deepakramanathan/sas-modernization-architectures-big-data-analytics
60