big data analytics using spark - github pages · •consult documentation of python, spark etc....

Post on 31-May-2020

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BigDataAnalyticsusingSpark

CSE255/DSE230

Whatis“BigData”?

• 1GB?• 1TB?

• 1PB?• ….

• Weneedadefinitionthatdoesnotchangeovertime.• Moredatathancanfitonasinglework-station.• Communicationdominatescomputation.

“DataScience”vs.“Computerscience”

• Computersciencefocusesonthealgorithm• Requirementsspecifyinputtooutputrelationship (findshortestpath)• Algorithmshouldbecorrectandefficient• Input(data)canbeanythingthatconformstoinputformat.

• DataSciencefocusesonthedata.• Thegoalistounderstand/model/controlthephysicalprocessgeneratingthedata.• Algorithmsareusedbythedatascientisttoidentifypatternsinthedata.• Dataisassumedtoconformtoastatisticalmodel.

Whatisadatascientist?From:DoingDataScience:StraightTalkfromtheFrontlineRachelSchutt&CathyO’Neil

&Communicationskills

Therearemanygoodjobsindatascience

• DataScientist: Oneofthetentopjobsin2016accordingtoForbesandglass-door.• Therearecurrently8446datascienceopeningsintheUS(LinkedIn).• 7000openingsinIndia(naukuri.com),• Medianbasesalaryisaround$116,000peryear(Glassdoor).

Halicioglu graduatedwithabachelor’sdegreeincomputersciencein1996

NickWoodman,FounderofGo-ProWoodmangraduatedfromUCSDinJune1997withaB.Ainvisualartsandaminorincreativewriting.

TheoutputofasinglegoPro

• GoProHeroBlack5:$400.• 120FPS1080p1920X1080• =250Mpixel/seceachpixel3*8bits=6Gbit/sec• Maxcompressedoutputbitrate60Mbit/sec• Compressionbyafactorof100.• 2:14minutes=1GBcompressed.• Imageprocessingrequiresuncompressed•

Processingatthesource

• SupposeyouwantedtouseGoProtomonitoryourfrontdoor.• TheGoProusessophisticatedlossy compressiontoreducedatabyafactorof100.• However,toperformanalysis,yourPCwouldhavetouncompress thedataandthenprocess>40GBperminute.• Youwouldneedabeefycomputer.• Butmostofthetimethereisverylittlechangefromframetoframe,soifchangedetectorisimplementedonthecamera,thereis,mostofthetime,nothingtocommunicate.

Scalingup:Sensornetworks& Smartcities

MatchPointhttps://datascience.sdsc.edu/matchpoint

CSE255/DSE230

• Afuncourse• Notaneasycourse.• WeeklyHW,fromFridaytoFridayexpecttospend~10hoursoneachHW.• Youareexpectedtofigureoutthingsonyourown.

• Consultdocumentationofpython,sparketc.• Brushuponyourlinearalgebra,eigen-vectors,eigen-values,eigen-decomposition.• Seelinearalgebramaterialonwebsite.• Wikipedia

• YouareexpectedtoparticipateinclassandonPiazza.

Whatwillyoulearn?From:DoingDataScience:StraightTalkfromtheFrontlineRachelSchutt&CathyO’Neil

&Communicationskills

PythonSpark

LinearAlgebraPCARegressionClassification

Jupyter NotebooksVisualizationInterpretationBreakdownProblems

Jupyter Notebooks

• Pullthemfromthegithub repository.• Theyareyourmainresource:• ClassSlidesarederivedfromthenotebooks• Code• Explanations• Pointerstoadditionalresources• Exercises

Grading

• HW:50%• Therewillbe9HWassignments,theonewiththelowestgradewillbedroppedfromtheaverage.

• Quiz:10%• EachThursday.Lowestgradedroppedfromaverage.

• BreakdownProblems:10%• Explainedonclasswebpage.

• Final:30%• Yetdodecidewhetherin-classortakehome.

Moredetailsonthewebsite

• Goto• https://mas-dse.github.io/DSE230/

top related