esrc bigdata clough 19junevolume$ • large$hadron$collider$atcern$ –...

31
Big Data: big opportuni/es and big problems Paul Clough Informa/on School University of Sheffield

Upload: others

Post on 31-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Big  Data:  big  opportuni/es  and  big  problems    

     Paul  Clough  

     Informa/on  School  

    University  of  Sheffield  

  • Outline  

    •  Introduc/on  •  No/ons  of  Big  Data  •  Opportuni/es  of  Big  Data  •  Challenges  of  Big  Data  •  Summary  

  • hBp://www.oreilly.com/data/free/bigdatanow2013.csp    

  • hBp://www.domo.com/blog/2014/04/data-‐never-‐sleeps-‐2-‐0/    

  • What  is  ‘Big  Data’?  •  “Simply  put,  it’s  about  data  sets  so  large  –  in  volume,  velocity  and  variety  –  

    that  they’re  impossible  to  manage  with  conven;onal  database  tools.”  (Michael  Friedenberg,  Network  World)  

    •  “Big  data  is  data  that  exceeds  the  processing  capacity  of  conven;onal  database  systems.  The  data  is  too  big,  moves  too  fast,  or  doesn’t  fit  the  strictures  of  your  database  architectures.  To  gain  value  from  this  data,  you  must  choose  an  alterna;ve  way  to  process  it.”  (Dumbill,  2012)  

    •  “Every  day  of  the  week,  we  create  2.5  quin;llion  bytes  of  data.  This  data  comes  from  everywhere:  from  sensors  used  to  gather  climate  informa;on,  posts  to  social  media  sites,  digital  pictures  and  videos  posted  online,  transac;on  records  of  online  purchases,  and  from  cell  phone  GPS  signals  –  to  name  a  few.  In  the  11  years  between  2009  and  2020,  the  size  of  the  ‘Digital  Universe’  will  increase  44  fold.  That’s  a  41%  increase  in  capacity  every  year.  In  addi;on,  only  5%  of  this  data  being  created  is  structured  and  the  remaining  95%  is  largely  unstructured,  or  at  best  semi-‐structured.  This  is  Big  Data.”  (Burlingame  &  Nielsen,  2013)  

  • Big  Data  ‘revolu/on’  •  Cukier  &  Mayer-‐Schoenberger  (2013)  argue  that  the  Big  

    Data  revolu/on  consists  of  –  Collec/ng  large  amounts  of  data  rather  than  smaller  samples  (from  some  to  all,  i.e.  N=All)  

    –  Tolera/ng  inaccuracies  in  larger  amounts  of  data  compared  to  higher  quality  smaller  amounts  (from  clean  to  messy)  

    –  Giving  up  on  knowing  the  causes  and  accept  only  associa/ons    “Using  big  data  will  some;mes  mean  forgoing  the  quest  for  why  in  return  for  knowing  what”  

    hBp://www.foreignaffairs.com/ar/cles/139104/kenneth-‐neil-‐cukier-‐and-‐viktor-‐mayer-‐schoenberger/the-‐rise-‐of-‐big-‐data    

  • Evolu/on  of  data  analysis    “The  idea  of  analysing  data  to  make  sense  of  what’s  happening  in  our  businesses  has  been  with  us  for  a  long  ;me  (in  corpora;ons  since  at  least  1954  ,  when  UPS  started  an  analy;cs  group),  so  why  do  we  have  to  keep  coming  up  with  new  names  to  describe  it?”  (Davenport,  2014:10)  

    Term   Time  frame   Specific  meaning  

    Decision  support   1970-‐1985   Use  of  data  analysis  to  support  decision  making  

    Execu/ve  support   1980-‐1990   Focus  on  data  analysis  for  decisions  by  senior  execu/ves  

    Online  Analy/cal  Processing  (OLAP)   1990-‐2000   Sodware  for  analysing  mul/dimensional  data  tables  

    Business  Intelligence   1989-‐2005   Tools  to  support  data-‐driven  decision,  with  emphasis  on  repor/ng  

    Analy/cs   2005-‐2010   Focus  on  sta/s/cal  and  mathema/cal  analysis  for  decisions  

    Big  Data   2010-‐present   Focus  on  very  large,  unstructured,  fast-‐moving  data  

  • Does  Big  Data  =  tradi/onal  data  analy/cs?  

    Big  Data   Tradi3onal  analy3cs  

    Type  of  data   Unstructured  formats   FormaBed  in  rows  and  columns  

    Volume  of  data   100  Terabytes  to  Petabytes   Tens  of  Terabytes  or  less  

    Flow  of  data   Constant  flow  of  data   Sta/c  pool  of  data  

    Analysis  methods   Machine  learning   Hypothesis-‐based  

    Primary  purpose   Data-‐based  products   Internal  decision  support  and  services  

    Source:  (Davenport,  2014:4)  

  • ‘Dimensions’  of  Big  Data  •  Analysis  from  IBM  iden/fied  the  main  dimensions  or  

    characteris/cs  of  Big  Data    –  Volume  (amount  of  data):  the  large  amount  of  data  being  generated  and  stored  (normally  in  the  order  of  TBs  or  PBs)  

    –  Variety  (forms  of  data):  the  range  of  data  types  and  sources  being  used,  including  unstructured  data  

    –  Velocity  (speed  of  data):  the  rate  at  which  data  is  collected,  shared  and  analysed  -‐  oden  real  /me  streaming  data  (e.g.,  from  social  media)  

    –  Veracity  (reliability  of  data):  uncertainty  in  data  quality  (accuracy,  provenance,  relevance  and  consistency)  

    The  Vs  debate  –  Gartner  got  there  first!  hBp://blogs.gartner.com/doug-‐laney/deja-‐vvvue-‐others-‐claiming-‐gartners-‐volume-‐velocity-‐variety-‐construct-‐for-‐big-‐data/    

  • Name   Equals  to   Size  in  bytes  

    Bit   1  bit   1/8  

    Nibble   4  bits   1/2  

    Byte   8  bits   1  

    Kilobyte   1024  bytes   1024  

    Megabyte   1024  kilobytes   1,048,576  

    Gigabyte   1024  megabytes   1,073,741,824  

    Terrabyte   1024  gigabytes   1,099,511,627,776  

    Petabyte   1024  terrabytes   1,125,899,906,842,624  

    Exabyte   1024  petabytes   1,152,921,504,606,846,976  

    ZeBabyte   1024  exabytes   1,180,591,620,717,411,303,424  

    YoBabyte   1024  zeBabytes   1,208,925,819,614,629,174,706,176  

    “There  was  5  Exabytes  of  informa;on  created  between  the  dawn  of  civilisa;on  through  2003,  but  that  much  informa;on  is  now  created  every  2  days,  and  the  pace  is  increasing.”  Eric  Schmidt,  Google  

  • Volume  •  Large  Hadron  Collider  at  CERN  

    –  Generates  around  25  petabytes  per  year                                                                                                                          (600  million  collisions  taking  place  every  second)  

    •  Walmart    –  Handles  more  than  1  million  customer  transac/ons  every  hour  –  Transac/ons  imported  into  databases  es/mated  to  contain  more  than  2.5  

    petabytes  of  data  •  Facebook  (back  in  2010)  

    –  500  million  ac/ve  users  –  100  billion  hits  per  day  –  50  billion  photos  –  2  trillion  objects  cached,  with  hundreds  of  millions  of  requests  per  second  –  130TB  of  logs  every  day  –  hBps://www.facebook.com/notes/facebook-‐engineering/scaling-‐facebook-‐

    to-‐500-‐million-‐users-‐and-‐beyond/409881258919    

    SAP  blog  post:  “even  with  rapid  growth  of  data  95%  of  enterprises  use  between  0.5TB-‐40TB  of  data  today.”  

    hBp://www.slideshare.net/BernardMarr/big-‐data-‐25-‐facts    

  • Variability  •  Event  or  transac/on  logs  •  Social  media  •  Sensor    •  Internet  of  Things  

    –  ~50  billion  sensors  connected  to  Internet  by  2025  

    •  Smartphone  •  Network  traffic  •  Images,  videos  and  sounds  •  Emails  •  Blog  posts  •  ……  

    Datafica/on:  “…  taking  all  aspects  of  life  and  turning  them  into  data.  Google’s  augmented-‐reality  glasses  datafy  the  gaze.  Twi]er  datafies  stray  thoughts.  LinkedIn  datafies  professional  networks.”    “A  2012  survey  by  NewVantage  Partners  of  over  fi`y  execu;ves  in  large  organisa;ons  suggests  that  for  large  companies,  the  lack  of  structure  of  data  is  more  salient  than  addressing  its  size.”  (Davenport,  2014)    Es3mated  that  95%  of  Big  Data  is  unstructured  

  • What’s  in  a  name?  •  Davenport  (2014)  highlights  a  number    of  problems  with  the  name  ‘Big  Data’  (not  the  idea),  including  –  ‘Big’  is  only  one  aspect  of  what’s  dis/nc/ve  about  new  forms  of  data  (structure  oden  the  bigger  problem)  

    –  ‘Big’  is  rela/ve  and  will  change  –  If  the  data  doesn’t  fit  all  the  Vs  is  it  s/ll  Big  Data?  –  The  term  ‘Big  Data’  is  being  misused  by  vendors  and  marking  companies  to  refer  to  any  analy/cs  and  repor/ng  

    “The  point  is  not  to  dazzled  by  the  volume  of  data,  but  rather  to  analyse  it  –  to  convert  it  into  insights,  innova;ons  and  business  value.”  (Davenport,  2014)  

  • Benefits  of  Big  Data  •  Economic  benefit:  gains  in  produc/vity,  compe//ve  

    advantage,  and  efficiency  •  Increased  demand  for  highly-‐skilled  data  literate  workforce  •  Promo/ng  awareness  of  data  and  access  to  large  open  

    datasets  (civic  engagement)  •  Enabling  beBer  understanding  in  various  domains  (e.g.  

    climate  trends)  •  …..  

    “The  main  difference  between  big  data  and  the  standard  data  analy;cs  that  we’ve  always  done  in  the  past  is  that  big  allows  us  to  predict  behaviour.  Also,  predict  events  based  upon    lots  of  sources  of  data  that  we  can  now  combine  in  ways  that  we  weren’t  able  to  before.”    Paul  Malyon,  Experian  

  • Uses  of  Big  Data  in  business  

    hBp://www.atkearney.com/strategic-‐it/ideas-‐insights/ar/cle/-‐/asset_publisher/LCcgOeS4t85g/content/big-‐data-‐and-‐the-‐crea/ve-‐destruc/on-‐of-‐today-‐s-‐business-‐models/10192    

  • Big  Data  opportuni/es  

    hBp://www.forbes.com/sites/louiscolumbus/2012/08/16/roundup-‐of-‐big-‐data-‐forecasts-‐and-‐market-‐es/mates-‐2012/  

  • The  value  of  Big  Data  

    •  At  least  three  classes  of  value  (Davenport,  2014)  –  Cost  reduc/ons  (e.g.,  use  of  Big  Data  technologies)  –  Improvements  in  decision-‐making  –  Improvements  in  products  and  services  (e.g.,  People  You  May  Know  or  PYMK  feature  in  LinkedIn)  

      “In  his  lecture  ‘The  unreasonable  effec;veness  of  Data’,  Peter  Norvig,  Director  of  Research  at  Google,  highlights  that  you  gain  much  be]er  insight  from  running  rela;vely  simple  algorithms  on  large  datasets  than  you  do  from  running  complex  algorithms  on  smaller  datasets.  Simply  put,  greater  volumes  of  data  can  provide  much  be]er  insights.”      Source:  hBp://www.garycrawford.co.uk/big-‐data-‐part-‐1-‐big-‐what/    

  • Big  Data  opportuni/es  

    James  Manyika,  Michael  Chui,  Brad  Brown,  Jacques  Bughin,  Richard  Dobbs,  Charles  Roxburgh,  Angela  Hung  Byers  (2012)  Big  data:  The  Next  Fron/er  for  Innova/on,  Compe//on,  and  Produc/vity.  McKinsey  Global  Ins/tute.  Available  online:  hBp://www.mckinsey.com/insights/business_technology/big_data_the_next_fron/er_for_innova/on    

  • Technical  challenges  with  Big  Data  

    •  Specialised  technologies  needed  to  store  and  manage  Big  Data  –  Dealing  with  large  datasets      

    (TBs  to  ZBs)  –  Dealing  with  data  of  an  

    unstructured  nature  –  Dealing  with  real-‐/me  streaming  

    data  (e.g.,  TwiBer)  •  Computa/onal  processing  and  

    speed  of  feedback  loop  (velocity)  •  Problems  with  analysing  large  

    datasets  •  Sources  of  Big  Data  oden  require  

    significant  pre-‐processing  

    hBp://www.ibm.com/developerworks/library/bd-‐streamsintro/    

  • Further  challenges  with  Big  Data  

    •  Alex  Pentland  (2012)  iden/fies  obstacles  for  Big  Data  –  The  correla/on  problem  –  with  larger  amounts  of  data  then  everything  becomes  sta/s/cally  significant  and  you  end  up  discovering  meaningless  paBerns  (Bonferroni’s  Principle)  

    –  The  “human  understanding”  problem  –  large  datasets  make  understanding  underlying  data  proper/es  difficult  (e.g.,  overfixng  and  hidden  biases)  

    –  The  provenance  problem  –  collec/ng  data  from  reliable  sources  and  tracking  subsequent  data  use  (and  reuse)  

    –  The  privacy  problem  –  consumers  are  asking  about  their  rights  to  prevent  collec/on  and  analysis  of  data  they  leave  behind  

    hBp://blogs.hbr.org/2012/10/big-‐datas-‐biggest-‐obstacles/  

  • hBp://www.d.com/cms/s/2/21a6e7d8-‐b479-‐11e3-‐a09a-‐00144feabdc0.html#axzz3FMtJew6E    

  • Challenges:  sample  error  and  bias  •  In  1936,  the  Republican  Alfred  Landon  stood  for  elec/on  against  

    President  Franklin  Delano  Roosevelt  •  Two  different  surveys  predicted  the  outcome:  

    –  The  Literary  Digest  conducted  a  postal  opinion  poll  with  the  aim  of  reaching  10  million  people,  a  quarter  of  the  electorate.  Ader  tabula/ng  an  astonishing  2.4  million  returns  as  they  flowed  in  over  two  months,  The  Literary  Digest  announced  its  conclusions:  Landon  would  win  by  a  convincing  55%  to  41%,  with  a  few  voters  favouring  a  third  candidate  

    –  George  Gallup  conducted  a  far  smaller  survey  (3,000  people)  and  forecast  a  comfortable  victory  for  Roosevelt  

    •  The  elec/on  result:  Roosevelt  crushed  Landon  by  61%  to  37%  •  George  Gallup  understood  something  that  The  Literary  Digest  did  not:  

    when  it  comes  to  data,  size  isn’t  everything  •  Opinion  pollsters  need  to  deal  with  two  issues:  sample  error  and  

    sample  bias  Source:  hBp://www.d.com/cms/s/2/21a6e7d8-‐b479-‐11e3-‐a09a-‐00144feabdc0.html#axzz3FMtJew6E    

  • What  went  wrong?  •  Sample  error:  risk  that  by  chance  the  sample  does  not  

    reflect  the  true  views  of  popula/on  –  Margin  of  error  reflect  this  risk  and  larger  sample  reduces  risk  –  Why  did  3,000  work  beBer  than  2.4  million?    –  Answer:  sample  bias  

    •  Sample  bias:  when  the  sample  is  not  chosen  randomly  –  Literary  Digest  mailed  out  forms  to  people  on  a  list  it  had  compiled  from  automobile  registra/ons  and  telephone  directories  –  a  sample  that,  at  least  in  1936,  was  dispropor/onately  prosperous  

    –  To  compound  the  problem,  Landon  supporters  turned  out  to  be  more  likely  to  mail  back  their  answers  

    –  George  Gallup  took  pains  to  find  an  unbiased  sample  because  he  knew  that  was  far  more  important  than  finding  a  big  one  

               

  •  “Professor  Viktor  Mayer-‐Schönberger  of  Oxford’s  Internet  Ins;tute,  co-‐author  of  Big  Data,  told  me  that  his  favoured  defini;on  of  a  big  data  set  is  one  where  “N  =  All”  –  where  we  no  longer  have  to  sample,  but  we  have  the  en;re  background  popula;on.  Returning  officers  do  not  es;mate  an  elec;on  result  with  a  representa;ve  tally:  they  count  the  votes  –  all  the  votes.  And  when  “N  =  All”  there  is  indeed  no  issue  of  sampling  bias  because  the  sample  includes  everyone.”  

     “But  is  “N  =  All”  really  a  good  descrip;on  of  most  of  the  found  data  sets  we  are  considering?  Probably  not.  “I  would  challenge  the  no;on  that  one  could  ever  have  all  the  data,”  says  Patrick  Wolfe,  a  computer  scien;st  and  professor  of  sta;s;cs  at  University  College  London.”  

     “An  example  is  Twi]er.  It  is  in  principle  possible  to  record  and  analyse  every  message  on  Twi]er  and  use  it  to  draw  conclusions  about  the  public  mood.  (In  prac;ce,  most  researchers  use  a  subset  of  that  vast  “fire  hose”  of  data.)  But  while  we  can  look  at  all  the  tweets,  Twi]er  users  are  not  representa;ve  of  the  popula;on  as  a  whole.  (According  to  the  Pew  Research  Internet  Project,  in  2013,  US-‐based  Twi]er  users  were  dispropor;onately  young,  urban  or  suburban,  and  black.)”  

    Challenges:  N=All?  

  • Challenges:  hidden  biases    “Data  and  data  sets  are  not  objec;ve;  they  are  crea;ons  of  human  design.  We  give  numbers  their  voice,  draw  inferences  from  them,  and  define  their  meaning  through  our  interpreta;ons.  Hidden  biases  in  both  the  collec;on  and  analysis  stages  present  considerable  risks,  and  are  as  important  to  the  big-‐data  equa;on  as  the  numbers  themselves.”    

       “Data  are  assumed  to  accurately  reflect  the  social  world,  but  there  are  significant  gaps,  with  li]le  or  no  signal  coming  from  par;cular  communi;es.”  

       “Social  science  methodologies  may  make  the  challenge  of  understanding  big  data  more  complex,  but  they  also  bring  context-‐awareness  to  our  research  to  address  serious  signal  problems.  Then  we  can  move  from  the  focus  on  merely  “big”  data  towards  something  more  three-‐dimensional:  data  with  depth.”  

    Kate  Crawford,  2013,  Harvard  Business  Review  blog,  hBp://blogs.hbr.org/2013/04/the-‐hidden-‐biases-‐in-‐big-‐data/    

  • Things  can  go  wrong:  Google  Flu  Trends  

    “We've  found  that  certain  search  terms  are  good  indicators  of  flu  ac;vity.  Google  Flu  Trends  uses  aggregated  Google  search  data  to  es;mate  current  flu  ac;vity  around  the  world  in  near  real-‐;me”  Ginsberg  J,  Mohebbi  MH,  Patel  RS,  Brammer  L,  Smolinski  MS,  et  al.  (2009)  Detec/ng  influenza  epidemics  using  search  engine  query  data.  Nature  457:  1012–1014  

  • Things  can  go  wrong:  Google  Flu  Trends  

    •  When  people  are  sick  with  flu  they  may  search  for  flu-‐related  informa/on  on  Google  –  Aggregated  over  lots  of  people  this  data  could  be  used  to  predict  flu  

    outbreaks  (collec;ve  intelligence)  •  Google  took  50  million  most  commonly  searched  terms  between  

    2003  and  2008  and  compared  them  against  historical  influenza  data  from  the  Centers  for  Disease  Control  and  Preven/on  (CDC)  

    •  Looked  at  temporal  paBerns  of  searches  to  see  whether  occurrences  correlated  with  outbreaks  of  flu  in  certain  areas  compared  to  CDC’s  data  –  45  terms  found  to  correlated  with  influenza  (e.g.,  “headache”  and  

    “runny  nose”)  •  Google  could  produce  accurate  es/mates  2  weeks  earlier  than  CDC  

    offering  life-‐saving  insights  

  • •  Search  terms  correlated  by  pure  chance  due  to  millions  of  search  terms  being  fiBed  to  CDC’s  data  –  e.g.,  “high  school  

    basketball”  •  Changes  in  users’  search  

    behaviour    –  Google’s  autosuggest  –  Media  influences  

    Things  can  go  wrong:  Google  Flu  Trends  

    Google’s  es3mates  of  the  spread  of  flu-‐like  illnesses  were  overstated  by  almost  a  factor  of  two  in  Feb  2013  Lazer,  David,  Ryan  Kennedy,  Gary  King,  and  Alessandro  Vespignani.  2014.  The  Parable  of  Google  Flu:  Traps  in  Big  Data  Analysis,  Science,  343,  no.  14  March:  1203-‐1205.  hBp://gking.harvard.edu/files/gking/files/0314policyforumff.pdf    

  • Summary  •  Big  Data  is  an  ill-‐defined  term  with  many  defini/ons  –  find  a  

    defini/on  you  can  work  with  •  Big  Data  is  becoming  an  obsession  with  scien/sts,  businesses,  

    governments  and  the  media  •  Much  value  can  be  gained  from  Big  Data  but  presents  challenges  

    –  “We  have  a  new  resource  here  [Big  Data],”  says  Professor  David  Hand  of  Imperial  College  London.  “But  nobody  wants  ‘data’.  What  they  want  are  the  answers.”  

    –  “Data  analysis  in  ignorance  of  the  context  can  quickly  become  meaningless  or  even  dangerous.”  Kate  Crawford  (2013)  

    –  “There  are  a  lot  of  small  data  problems  that  occur  in  big  data,”  says  Spiegelhalter.  “They  don’t  disappear  because  you’ve  got  lots  of  the  stuff.  They  get  worse.”  

    –  “’Big  data’  has  arrived,  but  big  insights  have  not.  The  challenge  now  is  to  solve  new  problems  and  gain  new  answers  –  without  making  the  same  old  sta;s;cal  mistakes  on  a  grander  scale  than  ever.”  Tim  Harford  (2014)  

  • Ques/ons?      

    Paul  Clough    

    Informa/on  School  University  of  Sheffield