python for data analysis and visualiza[on

34
Python for Data Analysis and Visualiza4on Fang (Cherry) Liu, Ph.D [email protected] PACE Gatech July 2013

Upload: trinhdat

Post on 02-Jan-2017

236 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Python for Data Analysis and Visualiza[on

Python  for  Data  Analysis  and  Visualiza4on  Fang  (Cherry)  Liu,  Ph.D  [email protected]  

PACE  Gatech  July  2013  

Page 2: Python for Data Analysis and Visualiza[on

Outline    

•  System  requirements  and  IPython  •  Why  use  python  for  data  analysis  and  visula4on  

•  Data  set  –  US  baby  names  1880-­‐2012  – Data  Loading  – Data  Processing  using  Lists  – Data  Aggreg4on  and  Group    

•  PloTng  and  visualiza4on  

[email protected]   2  

Page 3: Python for Data Analysis and Visualiza[on

System  Setup  •  Op4on  1  :    (Preferred)  Download  and  Install  the  Enthought  Canopy  product:  

hWps://www.enthought.com/products/canopy/academic/  Enthought  Canopy  is  free  for  Academic  Users.  This  will  install  a  full  Python  distribu4on  onto  your  computer.    

•  Op4on  2:  Download  and  Install  Python(x,y)  (This  is  for  Windows  only)  hWps://code.google.com/p/pythonxy/wiki/Downloads  This  will  install  a  full  Python  distribu4on  onto  your  computer.    Note1:  Op4ons  1  and  2  are  mutually  exclusive.  Please  do  not  install  both  Canopy  and  Python(x,y)  on  your  computer.  Note2:  Downloading  and  installing  either  Canopy  or  Python(x,y)  will  take  a  long  4me.    Note3:  During  this  course,  Canopy  will  be  used  to  type  and  execute  all  commands  (op4on  1).    

•  Op4on  3:  Use  the  Python  installed  on  PACE  clusters.  (You  need  a  PACE  account  for  this  to  work)  If  you  choose  this  op4on,  let  me  know  and  I'll  send  instruc4ons  that  will  help  ensure  that  your  environment  is  setup  properly  for  the  tutorials.    

•  Op4on  4:  Use  the  Python  already  installed  on  your  laptop.  As  long  as  Numpy,  SciPy,  Matplotlib,  IPython,  and  Pandas  are  installed  on  your  laptop,  you  will  be  able  follow  both  courses  (Scien4fic  Compu4ng  and  Data  Analysis  and  Visualiza4on).    

•     

[email protected]   3  

Page 4: Python for Data Analysis and Visualiza[on

IPython  –  An  Interac4ve  Compu4ng  and  Development  Environment  

•  It  provides  an  execute-­‐explore  workflow  instead  of  typical  edit-­‐compile-­‐run  workflow  of  many  other  programming  languages  

•  It  provides  very  4ght  integra4on  with  the  opera4ng  system’s  shell  and  file  system  

•  It  also  includes:  – A  rich  GUI  console  with  inline  ploTng  – A  web-­‐based  interac4ve  notebook  format  – A  lightweight,  fast  parallel  compu4ng  engine  

[email protected]   4  

Page 5: Python for Data Analysis and Visualiza[on

Why  use  Python  for  Data  Analysis  

•  The  Python  language  is  easy  to  fall  in  love  with  •  Python  is  dis4nguished  by  its  large  and  ac4ve  scien4fic  compu4ng  community  

•  Adop4on  of  Python  for  scien4fic  compu4ng  in  both  industry  applica4ons  and  academic  research  has  increased  significantly  since  the  early  2000s  

•  Python’s  improved  library  support  (pandas)  made  it  a  strong  tool  for  data  manipula4on  tasks  

[email protected]   5  

Page 6: Python for Data Analysis and Visualiza[on

Example:  US  Baby  Names  1880-­‐2012  

•  The  United  States  Social  Security  Administra4on  (SSA)  has  mad  available  data  on  the  frequency  of  baby  names  from  1880  through  2012,  this  data  set  is  ofen  used  in  illustra4ng  data  manipula4on  in  R,  Python,  etc.  The  data  can  be  obtained  at:  hWp://www.ssa.gov/oact/babynames/limits.html  

 •  Things  can  be  done  with  this  data  set  –  Visualize  the  propor4on  of  babies  given  a  par4cular  name  –  Determine  the  naming  trend    –  Determine  the  most  popular  names  in  each  year       [email protected]   6  

Page 7: Python for Data Analysis and Visualiza[on

Check  the  Data  •  In  IPython,    – MacOS  or  Linux:  use  the  UNIX  head  to  look  at  the  first  10  lines  of  the  one  of  the  files.      

– Windows:  download  the  files,  and  click  to  open  the  files  –  This  is  nicely  comma-­‐separated  form.  

       

[email protected]   7  

Page 8: Python for Data Analysis and Visualiza[on

Load  Data  •  Using  csv  module  from  the  standard  library,  CSV  means  Comma  Separated  Values,  and  any  delimiter  can  be  chosen.    

•  The  variable  table  contains  records  list  in  which  each  record  has  three  fields  :  name,  sex,  count    

[email protected]   8  

Page 9: Python for Data Analysis and Visualiza[on

Grouping  the  data  based  on  sex  

•  To  find  the  total  births  by  sex,  the  groupby  func4on  is  used:  –  It  returns  an  iterator  for  each  group  based  on  the  key  value  which  is  extracted  from  x[1]  (sex)  

–  Then  traverses  the  group  and  get  the  total  counts  –   Be  sure  to  do  “from  itertools  import  groupby”  first    

[email protected]   9  

Page 10: Python for Data Analysis and Visualiza[on

Anonymous  (lamda)  Func4ons  •  Anonymous  or  lambda  func4ons  are  simple  func4ons  

consis4ng  of  a  single  statement,  the  result  is  the  return  value.  

•  Lamda  func4ons  are  convenient  in  data  analysis  since  there  are  many  cases  where  data  transforma4on  func4ons  will  take  func4ons  as  arguments.  

[email protected]   10  

Page 11: Python for Data Analysis and Visualiza[on

Aggregate  the  data  at  the  year  and  sex  level  •  Since  the  

data  set  is  split  into  files  by  year,  one  need  to  traverse  all  the  files  to  get  the  total  number  of  births  per  year  per  sex  

 [email protected]   11  

Page 12: Python for Data Analysis and Visualiza[on

The  result  list  

•  (Lef)  first  10  records  in  pieces  list  •  (Right)  last  10  records  in  pieces  list  

[email protected]   12  

Page 13: Python for Data Analysis and Visualiza[on

Matplotlib  review  •  Before  we  start  ploTng  the  result,  let’s  review  the  plot  first  

 

[email protected]   13  

Page 14: Python for Data Analysis and Visualiza[on

Prepare  the  data  for  plot  •  Currently,  the  result  is  a  list  of  list,    each  internal  list  include  three  values,  [year,  female  births,  male  births],    to  plot  the  births  according  to  year  and  sex,  the  plot  needs  to  have  year  as  x-­‐axis,  and  births  as  y-­‐axis,  while  two  lines  will  be  showing  to  represent  female  and  male  birth.    

[email protected]   14  

Page 15: Python for Data Analysis and Visualiza[on

Plot  the  total  births  by  sex  and  year  

•  Plot    

[email protected]   15  

Page 16: Python for Data Analysis and Visualiza[on

Reorganize  the  data  

•  Concatenate  the  all  files  together  to  prepare  the  further  analysis.  

   

[email protected]   16  

Page 17: Python for Data Analysis and Visualiza[on

Extract  a  subset  of  the  data  

•  Find  the  top  1000  names  for  each  sex/year  combina4on,  further  narrow  down  the  data  set  to  facilitate  further  analysis,  the  sor4ng  is  ignored  here  since  the  input  files  are  already  in  descending  order  

[email protected]   17  

Page 18: Python for Data Analysis and Visualiza[on

Compare  the  subset  data  with  original  data  

•  The  subset  data  has  much  less  records  than  the  original  data  set,  but  represents  the  majority  informa4on  

[email protected]   18  

Page 19: Python for Data Analysis and Visualiza[on

Analyzing  Naming  Trends  

•  With  the  full  data  set  and  Top  1,000  data  set  in  hand,  we  can  start  analyzing  various  naming  trends  of  interest.  SpliTng  the  Top  1,000  names  into  the  boy  and  girl  por4ons:  

[email protected]   19  

Page 20: Python for Data Analysis and Visualiza[on

Analyzing  Naming  Trends  (Cont.)  •  Plot  for  a  handful  of  names  in  a  subplot,  John,  Harry,  Marry,  to  compare  their  trends  over  the  years,  first  prepare  data  set  for  each  chosen  name.  

 [email protected]   20  

Page 21: Python for Data Analysis and Visualiza[on

Analyzing  Naming  Trends  (Cont.)  •  Plot  three  curves  

ver4cally,  with  x-­‐axis  as  years,  y-­‐axis  as  births,  the  result  shows  that  those  names  have  grown  out  of  favor  with  American  popula4on  

 

[email protected]   21  

Page 22: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  

•  To  explain  why  there  is  a  decrease  in  the  previous  plots,  we  can  measure  the  propor4on  of  births  represented  by  the  top  1000  most  popular  names  by  year  and  sex  –  Step  1:  find  total  of  birth  per  year  for  each  sex    

[email protected]   22  

Page 23: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

– Step  2:  compute  the  propor4on  of  top  1000  births    to  the  total  births  per  year  per  sex    

For  boys:    

[email protected]   23  

Page 24: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

For  girls:    

[email protected]   24  

Page 25: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

•  Plot  the  result  shows  that  fewer  parents  are  choosing  the  popular  names  for  their  children  over  the  years    

 

[email protected]   25  

Page 26: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

•  Another  interest  metric  is  the  number  of  dis4nct  popular  names,  taken  in  order  of  popularity  from  highest  to  lowest  in  the  top  50%  of  births.  –  Step  1:  Add  the  fourth  column  to  girls1000  and  boys1000  list,  to  represent  the  birth  propor4on  to  the  total  birth  of  the  given  year,  then  sort  the  list  in  descending  order  on  propor4on,  sort  the  list  again  in  ascending  order  on  years.  The  result  list  will  have  each  years  records  in  a  chunk  with  propor4on  number  in  decreasing  order.  

[email protected]   26  

Page 27: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

•  For  girls:  

[email protected]   27  

Page 28: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

•  For  boys:  

[email protected]   28  

Page 29: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

•  Step  2:  Adding  the  propor4on  for  each  year  from  highest  un4l  the  total  propor4on  reaches  50%,  recording  the  number  of  individual  names    

[email protected]   29  

Page 30: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

•  For  girls:  

[email protected]   30  

Page 31: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

•  For  boys:  

[email protected]   31  

Page 32: Python for Data Analysis and Visualiza[on

Measuring  the  increase  in  naming  diversity  (Cont.)  

•  Step  3:  Plot  the  result,  as  you  can  see,  girl  names  has  always  been  more  diverse  than  boy  names,  and  the  dis4nguished  names  become  more  over  4me.    

[email protected]   32  

Page 33: Python for Data Analysis and Visualiza[on

Python  Library  for  Data  Analysis  •  Pandas  wriWen  by  Wes  McKinney  hWp://pandas.pydata.org/  –  provides  rich  data  structures  and  func4ons    working  with  structured  data  

–  It  is  one  of  the  cri4cal  ingredients  enabling  Python  to  be  a  powerful  and  produc4ve  data  analysis  environment.  

–  The  primary  object  in  pandas  is  called  DataFrame  –  a  two-­‐dimensional  tabular,  column-­‐oriented  data  structure  with  both  row  and  column  labels    

–  Pandas  combines  the  features  of  NumPy,  spreadsheets  and  rela4onal  databases  

[email protected]   33  

Page 34: Python for Data Analysis and Visualiza[on

Useful  Links  

•  Python  Scien4fic  Lecture  Notes  hWp://scipy-­‐lectures.github.io/  

•  Matplotlib  hWp://matplotlib.org/  •  Documenta4on  hWp://docs.python.org  

[email protected]   34