esrc bigdata clough 19junevolume$ • large$hadron$collider$atcern$ –...

Big Data: big opportuni/es and big problems

Paul Clough

Informa/on School

University of Sheffield

Outline

•  Introduc/on •  No/ons of Big Data •  Opportuni/es of Big Data •  Challenges of Big Data •  Summary

hBp://www.oreilly.com/data/free/bigdatanow2013.csp

hBp://www.domo.com/blog/2014/04/data-‐never-‐sleeps-‐2-‐0/

What is ‘Big Data’? •  “Simply put, it’s about data sets so large – in volume, velocity and variety –

that they’re impossible to manage with conven;onal database tools.” (Michael Friedenberg, Network World)

•  “Big data is data that exceeds the processing capacity of conven;onal database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alterna;ve way to process it.” (Dumbill, 2012)

•  “Every day of the week, we create 2.5 quin;llion bytes of data. This data comes from everywhere: from sensors used to gather climate informa;on, posts to social media sites, digital pictures and videos posted online, transac;on records of online purchases, and from cell phone GPS signals – to name a few. In the 11 years between 2009 and 2020, the size of the ‘Digital Universe’ will increase 44 fold. That’s a 41% increase in capacity every year. In addi;on, only 5% of this data being created is structured and the remaining 95% is largely unstructured, or at best semi-‐structured. This is Big Data.” (Burlingame & Nielsen, 2013)

Big Data ‘revolu/on’ •  Cukier & Mayer-‐Schoenberger (2013) argue that the Big

Data revolu/on consists of –  Collec/ng large amounts of data rather than smaller samples (from some to all, i.e. N=All)

–  Tolera/ng inaccuracies in larger amounts of data compared to higher quality smaller amounts (from clean to messy)

–  Giving up on knowing the causes and accept only associa/ons “Using big data will some;mes mean forgoing the quest for why in return for knowing what”

hBp://www.foreignaffairs.com/ar/cles/139104/kenneth-‐neil-‐cukier-‐and-‐viktor-‐mayer-‐schoenberger/the-‐rise-‐of-‐big-‐data

Evolu/on of data analysis “The idea of analysing data to make sense of what’s happening in our businesses has been with us for a long ;me (in corpora;ons since at least 1954 , when UPS started an analy;cs group), so why do we have to keep coming up with new names to describe it?” (Davenport, 2014:10)

Term Time frame Specific meaning

Decision support 1970-‐1985 Use of data analysis to support decision making

Execu/ve support 1980-‐1990 Focus on data analysis for decisions by senior execu/ves

Online Analy/cal Processing (OLAP) 1990-‐2000 Sodware for analysing mul/dimensional data tables

Business Intelligence 1989-‐2005 Tools to support data-‐driven decision, with emphasis on repor/ng

Analy/cs 2005-‐2010 Focus on sta/s/cal and mathema/cal analysis for decisions

Big Data 2010-‐present Focus on very large, unstructured, fast-‐moving data

Does Big Data = tradi/onal data analy/cs?

Big Data Tradi3onal analy3cs

Type of data Unstructured formats FormaBed in rows and columns

Volume of data 100 Terabytes to Petabytes Tens of Terabytes or less

Flow of data Constant flow of data Sta/c pool of data

Analysis methods Machine learning Hypothesis-‐based

Primary purpose Data-‐based products Internal decision support and services

Source: (Davenport, 2014:4)

‘Dimensions’ of Big Data •  Analysis from IBM iden/fied the main dimensions or

characteris/cs of Big Data –  Volume (amount of data): the large amount of data being generated and stored (normally in the order of TBs or PBs)

–  Variety (forms of data): the range of data types and sources being used, including unstructured data

–  Velocity (speed of data): the rate at which data is collected, shared and analysed -‐ oden real /me streaming data (e.g., from social media)

–  Veracity (reliability of data): uncertainty in data quality (accuracy, provenance, relevance and consistency)

The Vs debate – Gartner got there first! hBp://blogs.gartner.com/doug-‐laney/deja-‐vvvue-‐others-‐claiming-‐gartners-‐volume-‐velocity-‐variety-‐construct-‐for-‐big-‐data/

Name Equals to Size in bytes

Bit 1 bit 1/8

Nibble 4 bits 1/2

Byte 8 bits 1

Kilobyte 1024 bytes 1024

Megabyte 1024 kilobytes 1,048,576

Gigabyte 1024 megabytes 1,073,741,824

Terrabyte 1024 gigabytes 1,099,511,627,776

Petabyte 1024 terrabytes 1,125,899,906,842,624

Exabyte 1024 petabytes 1,152,921,504,606,846,976

ZeBabyte 1024 exabytes 1,180,591,620,717,411,303,424

YoBabyte 1024 zeBabytes 1,208,925,819,614,629,174,706,176

“There was 5 Exabytes of informa;on created between the dawn of civilisa;on through 2003, but that much informa;on is now created every 2 days, and the pace is increasing.” Eric Schmidt, Google

Volume •  Large Hadron Collider at CERN

–  Generates around 25 petabytes per year (600 million collisions taking place every second)

•  Walmart –  Handles more than 1 million customer transac/ons every hour –  Transac/ons imported into databases es/mated to contain more than 2.5

petabytes of data •  Facebook (back in 2010)

–  500 million ac/ve users –  100 billion hits per day –  50 billion photos –  2 trillion objects cached, with hundreds of millions of requests per second –  130TB of logs every day –  hBps://www.facebook.com/notes/facebook-‐engineering/scaling-‐facebook-‐

to-‐500-‐million-‐users-‐and-‐beyond/409881258919

SAP blog post: “even with rapid growth of data 95% of enterprises use between 0.5TB-‐40TB of data today.”

hBp://www.slideshare.net/BernardMarr/big-‐data-‐25-‐facts

Variability •  Event or transac/on logs •  Social media •  Sensor •  Internet of Things

–  ~50 billion sensors connected to Internet by 2025

•  Smartphone •  Network traffic •  Images, videos and sounds •  Emails •  Blog posts •  ……

Datafica/on: “… taking all aspects of life and turning them into data. Google’s augmented-‐reality glasses datafy the gaze. Twi]er datafies stray thoughts. LinkedIn datafies professional networks.” “A 2012 survey by NewVantage Partners of over fi`y execu;ves in large organisa;ons suggests that for large companies, the lack of structure of data is more salient than addressing its size.” (Davenport, 2014) Es3mated that 95% of Big Data is unstructured

What’s in a name? •  Davenport (2014) highlights a number of problems with the name ‘Big Data’ (not the idea), including –  ‘Big’ is only one aspect of what’s dis/nc/ve about new forms of data (structure oden the bigger problem)

–  ‘Big’ is rela/ve and will change –  If the data doesn’t fit all the Vs is it s/ll Big Data? –  The term ‘Big Data’ is being misused by vendors and marking companies to refer to any analy/cs and repor/ng

“The point is not to dazzled by the volume of data, but rather to analyse it – to convert it into insights, innova;ons and business value.” (Davenport, 2014)

Benefits of Big Data •  Economic benefit: gains in produc/vity, compe//ve

advantage, and efficiency •  Increased demand for highly-‐skilled data literate workforce •  Promo/ng awareness of data and access to large open

datasets (civic engagement) •  Enabling beBer understanding in various domains (e.g.

climate trends) •  …..

“The main difference between big data and the standard data analy;cs that we’ve always done in the past is that big allows us to predict behaviour. Also, predict events based upon lots of sources of data that we can now combine in ways that we weren’t able to before.” Paul Malyon, Experian

Uses of Big Data in business

hBp://www.atkearney.com/strategic-‐it/ideas-‐insights/ar/cle/-‐/asset_publisher/LCcgOeS4t85g/content/big-‐data-‐and-‐the-‐crea/ve-‐destruc/on-‐of-‐today-‐s-‐business-‐models/10192

Big Data opportuni/es

hBp://www.forbes.com/sites/louiscolumbus/2012/08/16/roundup-‐of-‐big-‐data-‐forecasts-‐and-‐market-‐es/mates-‐2012/

The value of Big Data

•  At least three classes of value (Davenport, 2014) –  Cost reduc/ons (e.g., use of Big Data technologies) –  Improvements in decision-‐making –  Improvements in products and services (e.g., People You May Know or PYMK feature in LinkedIn)

“In his lecture ‘The unreasonable effec;veness of Data’, Peter Norvig, Director of Research at Google, highlights that you gain much be]er insight from running rela;vely simple algorithms on large datasets than you do from running complex algorithms on smaller datasets. Simply put, greater volumes of data can provide much be]er insights.” Source: hBp://www.garycrawford.co.uk/big-‐data-‐part-‐1-‐big-‐what/

Big Data opportuni/es

James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers (2012) Big data: The Next Fron/er for Innova/on, Compe//on, and Produc/vity. McKinsey Global Ins/tute. Available online: hBp://www.mckinsey.com/insights/business_technology/big_data_the_next_fron/er_for_innova/on

Technical challenges with Big Data

•  Specialised technologies needed to store and manage Big Data –  Dealing with large datasets

(TBs to ZBs) –  Dealing with data of an

unstructured nature –  Dealing with real-‐/me streaming

data (e.g., TwiBer) •  Computa/onal processing and

speed of feedback loop (velocity) •  Problems with analysing large

datasets •  Sources of Big Data oden require

significant pre-‐processing

hBp://www.ibm.com/developerworks/library/bd-‐streamsintro/

Further challenges with Big Data

•  Alex Pentland (2012) iden/fies obstacles for Big Data –  The correla/on problem – with larger amounts of data then everything becomes sta/s/cally significant and you end up discovering meaningless paBerns (Bonferroni’s Principle)

–  The “human understanding” problem – large datasets make understanding underlying data proper/es difficult (e.g., overfixng and hidden biases)

–  The provenance problem – collec/ng data from reliable sources and tracking subsequent data use (and reuse)

–  The privacy problem – consumers are asking about their rights to prevent collec/on and analysis of data they leave behind

hBp://blogs.hbr.org/2012/10/big-‐datas-‐biggest-‐obstacles/

hBp://www.d.com/cms/s/2/21a6e7d8-‐b479-‐11e3-‐a09a-‐00144feabdc0.html#axzz3FMtJew6E

Challenges: sample error and bias •  In 1936, the Republican Alfred Landon stood for elec/on against

President Franklin Delano Roosevelt •  Two different surveys predicted the outcome:

–  The Literary Digest conducted a postal opinion poll with the aim of reaching 10 million people, a quarter of the electorate. Ader tabula/ng an astonishing 2.4 million returns as they flowed in over two months, The Literary Digest announced its conclusions: Landon would win by a convincing 55% to 41%, with a few voters favouring a third candidate

–  George Gallup conducted a far smaller survey (3,000 people) and forecast a comfortable victory for Roosevelt

•  The elec/on result: Roosevelt crushed Landon by 61% to 37% •  George Gallup understood something that The Literary Digest did not:

when it comes to data, size isn’t everything •  Opinion pollsters need to deal with two issues: sample error and

sample bias Source: hBp://www.d.com/cms/s/2/21a6e7d8-‐b479-‐11e3-‐a09a-‐00144feabdc0.html#axzz3FMtJew6E

What went wrong? •  Sample error: risk that by chance the sample does not

reflect the true views of popula/on –  Margin of error reflect this risk and larger sample reduces risk –  Why did 3,000 work beBer than 2.4 million? –  Answer: sample bias

•  Sample bias: when the sample is not chosen randomly –  Literary Digest mailed out forms to people on a list it had compiled from automobile registra/ons and telephone directories – a sample that, at least in 1936, was dispropor/onately prosperous

–  To compound the problem, Landon supporters turned out to be more likely to mail back their answers

–  George Gallup took pains to find an unbiased sample because he knew that was far more important than finding a big one

“Professor Viktor Mayer-‐Schönberger of Oxford’s Internet Ins;tute, co-‐author of Big Data, told me that his favoured defini;on of a big data set is one where “N = All” – where we no longer have to sample, but we have the en;re background popula;on. Returning officers do not es;mate an elec;on result with a representa;ve tally: they count the votes – all the votes. And when “N = All” there is indeed no issue of sampling bias because the sample includes everyone.”

“But is “N = All” really a good descrip;on of most of the found data sets we are considering? Probably not. “I would challenge the no;on that one could ever have all the data,” says Patrick Wolfe, a computer scien;st and professor of sta;s;cs at University College London.”

“An example is Twi]er. It is in principle possible to record and analyse every message on Twi]er and use it to draw conclusions about the public mood. (In prac;ce, most researchers use a subset of that vast “fire hose” of data.) But while we can look at all the tweets, Twi]er users are not representa;ve of the popula;on as a whole. (According to the Pew Research Internet Project, in 2013, US-‐based Twi]er users were dispropor;onately young, urban or suburban, and black.)”

Challenges: N=All?

Challenges: hidden biases “Data and data sets are not objec;ve; they are crea;ons of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpreta;ons. Hidden biases in both the collec;on and analysis stages present considerable risks, and are as important to the big-‐data equa;on as the numbers themselves.”

“Data are assumed to accurately reflect the social world, but there are significant gaps, with li]le or no signal coming from par;cular communi;es.”

“Social science methodologies may make the challenge of understanding big data more complex, but they also bring context-‐awareness to our research to address serious signal problems. Then we can move from the focus on merely “big” data towards something more three-‐dimensional: data with depth.”

Kate Crawford, 2013, Harvard Business Review blog, hBp://blogs.hbr.org/2013/04/the-‐hidden-‐biases-‐in-‐big-‐data/

Things can go wrong: Google Flu Trends

“We've found that certain search terms are good indicators of flu ac;vity. Google Flu Trends uses aggregated Google search data to es;mate current flu ac;vity around the world in near real-‐;me” Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, et al. (2009) Detec/ng influenza epidemics using search engine query data. Nature 457: 1012–1014


•  When people are sick with flu they may search for flu-‐related informa/on on Google –  Aggregated over lots of people this data could be used to predict flu

outbreaks (collec;ve intelligence) •  Google took 50 million most commonly searched terms between

2003 and 2008 and compared them against historical influenza data from the Centers for Disease Control and Preven/on (CDC)

•  Looked at temporal paBerns of searches to see whether occurrences correlated with outbreaks of flu in certain areas compared to CDC’s data –  45 terms found to correlated with influenza (e.g., “headache” and

“runny nose”) •  Google could produce accurate es/mates 2 weeks earlier than CDC

offering life-‐saving insights

•  Search terms correlated by pure chance due to millions of search terms being fiBed to CDC’s data –  e.g., “high school

basketball” •  Changes in users’ search

behaviour –  Google’s autosuggest –  Media influences


Google’s es3mates of the spread of flu-‐like illnesses were overstated by almost a factor of two in Feb 2013 Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. The Parable of Google Flu: Traps in Big Data Analysis, Science, 343, no. 14 March: 1203-‐1205. hBp://gking.harvard.edu/files/gking/files/0314policyforumff.pdf

Summary •  Big Data is an ill-‐defined term with many defini/ons – find a

defini/on you can work with •  Big Data is becoming an obsession with scien/sts, businesses,

governments and the media •  Much value can be gained from Big Data but presents challenges

–  “We have a new resource here [Big Data],” says Professor David Hand of Imperial College London. “But nobody wants ‘data’. What they want are the answers.”

–  “Data analysis in ignorance of the context can quickly become meaningless or even dangerous.” Kate Crawford (2013)

–  “There are a lot of small data problems that occur in big data,” says Spiegelhalter. “They don’t disappear because you’ve got lots of the stuff. They get worse.”

–  “’Big data’ has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old sta;s;cal mistakes on a grander scale than ever.” Tim Harford (2014)

Ques/ons?

Paul Clough

Informa/on School University of Sheffield

esrc bigdata clough 19junevolume$ • large$hadron$collider$atcern$ –...

Documents