cse427$ – cloud$computing$ …m.neumann/fl2016/cse427/00_courseoutline.pdf · cse427$ –...

16
CSE 427 – CLOUD COMPUTING WITH BIG DATA APPLICATIONS Fall 2016 Marion Neumann COURSE SYLLABUS

Upload: others

Post on 14-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

CSE  427  – CLOUD  COMPUTING  WITH  BIG  DATA  APPLICATIONS

Fall  2016Marion  Neumann

COURSE  SYLLABUS

Page 2: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

ABOUT• Marion  Neumann• email:  m  dot  neumann at  wustl dot  edu• office:  Jolley Hall  222• office  hours:  MON  4-­‐5pm

• Course  website:http://sites.wustl.edu/neumann/courses/fall-­‐2016/cse-­‐427s/

• Piazza• use  it  for  any  questions  and  suggestions  about  the  course!  Sign  up  here:  piazza.com/wustl/fall2016/cse427s

• no  anonymous posts

1/19/16 2

You  are  a  real person!

Page 3: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

LECTURES  AND  LABS• Monday  &  Wednesday  2:30-­‐4pm

• Lectures in Louderman  458• Lab  sessions  in  Eads  016

– ca.  7-­‐8  labs  replace  respective  lectures  – will  be  announced  in-­‐class,  on  Piazza,  and  the  course  webpage

• Lecture  participation  is  beneficial• Black/white  board  notes• Demos/practical  examples• Quizzes

• Lab  participation  is  beneficial• VM  debugging  with  fellow  students,  TAs,  and  instructor• data  preparation  for  homeworks• Quizzes

1/19/16 3

Page 4: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

HOMEWORK  ASSIGNMENTS• Homework  assignments  (40%)

• weekly!– assigned  on  MON  or  as  announced  in-­‐class  (after  the  lecture)– due  following  MON  (2:30pm  before  the  lecture)  or  as  indicated  on  assignment  

• work  in  groups  of  2  (you  can  use  Piazza  to  find  a  partner)  • use  SVN  repository  for  submissionsà find  instructions  how  to  use  SVN  on  the  course  webpage

• Final  Project  (20%)• implementation  component and  conceptual  component• due  14th of  December  

• TA  office  hours• TBA

1/19/16 4

Page 5: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

LATE  POLICY  – NO  MAKE  UPS

• homework  assignments  must  be  turned  in  on  time• it  is  your  responsibility  to  commit  your  work  to  your  SVN  repo

I  am  NOT taking  late  submissions!

• you  get  an  automatic 1  class extension  on  every  homework                        à use  this  with  caution:

There  is  no  extension  to  this  extension  for  NO reason  (at  all).  

• no  makeup  quizzes  or  assignments  for  any  reason  (this  includes  grade  improvements,  failed  SVN  commits,  miss-­‐interpreted  due  dates,  …)

1/19/16 5

Page 6: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

COLLABORATION  AND  ACADEMIC  DISHONESTY

• Collaboration  PolicyYou  are  encouraged  to  discuss  the  course  material  with  other  students.  Discussing  the  material,  and  the  general   form  of  solutions  to  the  labs  is  a  key  part  of  the  class.  Since,  for  many  of  the  assignments,  there  is  no  single  “right”  answer,  talking  to  other  students  and  to  the  TAs  is  a  good  thing.  However,  everything  that  you  turn  in  should  be  your  own  work,  unless  we  tell   you  otherwise.   If  you  talk  about  assignments  with  another  student,  then  you  need  to  explicitly   tell  us  on  the  hand-­‐in.  You  are  not  allowed  to  copy  answers  or  parts  of  answers  from  anyone  else,  or  from  material  you  find  on  the  Internet. This  will  be  considered  as  willful   cheating,  and  will  be  dealt  with  according   to  the  official  collaboration  policy.

Your  solutions  will  be  compared  to  the  solutions  of  other  students  and  solutions  available  ONLINE!

• Academic  DishonestyUnless  explicitly   instructed  otherwise,  everything  that  you  turn  in  for  this  course  must  be  your  own  work.  If  you  willfully  misrepresent   someone  else’s  work  as  your  own,  you  are  guilty  of  cheating.  Cheating,  in  any  form,  will  not  be  tolerated  in  this  class.  There   is  zero  tolerance of  Academic  Dishonesty.  I  will  be  actively  searching   for  academic  dishonesty  on  all  homework  assignments,  quizzes,  and  exams.  If  you  are  guilty  of  cheating  on  any  assignment  or  exam,  you  will   receive  and  F  in  the  course  and  be  referred  to  the  School  of  Engineering  Discipline   Committee.  In  severe  cases,  this  can  lead  to  expulsion  from  the  University,  as  well  as  possible  deportation  for  international   students.  If  you  copy  from  anyone  in  the  class  both  parties  will  be  penalized,  regardless  of  which  direction   the  information  flowed.  

This  is  your  only  warning.

1/19/16 6

Page 7: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

IN-­‐CLASS  EXAMS  AND  QUIZZES• 2  in-­‐class  exams

• Count  for  20%  of  total  class  performance  each• Dates:

• Final:  7  Dec  2016• Midterm:  TBA

• Quizzes• will  be  given  in  lectures  and  labs• need  WIFI  enabled  device  (laptop,  tablet,  smart  phone,  …)• completion  and  results  will  be  recorded  (via  student  ID)• will  be  used  to  decide  letter  grades  for  boarderlined scores (less  

than  1%  away  from  cutoff)• >60%  quiz  participation  is  required  for  “grade  bump”• no  makeup for  quizzes

1/19/16 7

QUIZ

Page 8: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

GRADING  POLICY

• Grading Summary• 40% homework  assignments• 20%  final  project• 20% midterm• 20% final

• if  boarderlined:  >  60%  completed  quizzes allow  for  better  grade

• This  is  only  half a  “Systems”  class!• exams  test  your  conceptual  knowledge  • exams  count  for  50%  of  the  course                                                performance

1/19/16 8

implementation  skills

conceptual  understanding   /  critical  thinking

CSE  427  𝑺𝟐

Page 9: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

WARM-­‐UP  QUIZ  

1/21/16 9

• go  to:  https://b.socrative.com/login/student/

• room  name  will  be  announced  in-­‐class

• enter  your  student  ID  (6-­‐digit  number)  • NOT  your  name

QUIZ

Page 10: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

COURSE  OBJECTIVE• Understand  conceptually

• what  Big  Data  is  • what  large-­‐scale  data  management  and  analysis  means

• Understand  specifically• how  MapReduce  implements  distributed  data  analysis  • how  a  Hadoop  cluster  achieves  parallel  computing  and  data  storage• the  development  process  to  tackle  Big  Data  analysis  tasks• which  Hadoop  Big  Data  tools  are  useful  for  which  application

• Hands-­‐on practice• using  Hadoop• implementing   algorithms  in  MapReduce  (Java)  and  Spark  (Python  or  Scala)

• data  analysis  with  Hadoop  tools  (Pig,  Hive,  Impala)

1/19/16 10

Page 11: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

TOPICS  TO  BE  COVERED  (SYLLABUS)  PART  I:  Map  Reduce• Distributed  File  Systems  &  MapReduce

• HDFS• Hadoop  MapReduce

• Developing  Programs  in  Hadoop  MapReduce• MapReducing  Algorithms• Introduction  to  Apache  Spark

PART  II:  Big  Data  Analysis• Application:  Recommendation  engines• Data  Analysis

• Hadoop  Pig,  Hive,  and  Impala• Data  Management

• Hadoop  tools  (Sqoop)

1/19/16 Contents  may  be  subject  to  changes! 11

Page 12: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

TOPICS  TO  BE  COVERED  (SYLLABUS)  PART  III:  (More)  Big  Data  Applications• Large-­‐scale  Machine  Learning

• Classification  using  MapReduce  • Clustering  in  Spark

Optional:  Structured  and  High-­‐dimensional  Data• Graph  Data

• Link  Analysis  using  PageRank• Social  network  analysis

• Information  Retrieval/Finding  Similar  Items• Big  feature  spaces• Document  retrieval• Locality-­‐sensitive  hashing

1/19/16 Contents  may  be  subject  to  changes! 12

Page 13: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

BACKGROUND  &  PREREQUISITES• Programming

• Java  (ßmainly)• Python  [or  Pearl,  Scala]  (some)• SQL  (ß very  useful)  and  relational  databases  (RDMS)

• Algorithms• sorting• hashing• CSE  247

• Maths• matrices,  some  linear  algebra• probabilities• graphs• machine  learning  (supervised  learning,  classification,  training/testing)

1/19/16 13

Page 14: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

COURSE  MATERIALS

• The  content  of  this  class  is  derived  largely  from  the  Cloudera  Developer  Training  for  Apache  Hadoop  ,  Cloudera  Data  Analyst  Training:  Using  Pig,  Hive,  and  Impala  with  Hadoop,  and  the Cloudera  Developer  Training  for  Apache  Spark,  which  are  made  available  to  Washington  University  through  the  Cloudera  Academic  Parntership program.  

• Cloudera  Course  VM  à install  beforeWED  7th of  Sept!

• Further  materials  are  adapted  from  the  “Mining  of  Massive  Data  Sets”  book (http://www.mmds.org/)  and  class taught  at  Stanford  by  Jure  Leskovec

1/19/16 14

Page 15: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

READING

• Required  books• Hadoop:  The  Definite  Guide (4th edition)  

by  Tom  White

• Mining  of  Massive  Data  Setsby  Jure  Leskovec,  AnandRajaraman,  Jeff  Ullman  (available  for  free  online  http://mmds.org)

• Optional  book• Data  Algorithms:  Recipes  for  Scaling  Up  with  

Hadoop  and  Spark  by  Mahmoud  Parsian

• Reading  and  further  materials  will  be  posted  on  the  course  webpage.• All readings  are  considered  course  material  and  are  exam  relevant!

1/19/16

Use  CLDR14 to  save  40% on  

O’Reilly  books  &  50% on  ebooks!

15

BEFORE  next  lecture

Page 16: CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ – cloud$computing$ with$big$data$applications fall$2016 marion$neumann coursesyllabus

1/19/16

SUMMARY

• All  relevant  information  can  be  found  on  the  course  webpage:http://sites.wustl.edu/neumann/courses/fall-­‐2016/cse-­‐427s/

• Ask  all  questions  on  Piazza!

Do  you  have  any  questions??

16

You  are  a  real person!