facebook.com/statisticssweden @scb_nyheter unlocking the full potential of big data lilli japec,...
TRANSCRIPT
facebook.com/statisticssweden @SCB_nyheter
Unlocking the Full Potential of Big Data
Lilli Japec, Frauke Kreuter
JOS anniversary
June 2015
The report is available at https://www.aapor.org
Task Force Members: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter, Co-Chair, JPSM at the U. of Maryland, U. of Mannheim & IAB Marcus Berg, Stockholm University Paul Biemer, RTI International Paul Decker, Mathematica Policy Research Cliff Lampe, School of Information at the University of Michigan Julia Lane, American Institutes for Research Cathy O’Neil, Johnson Research Labs Abe Usher, HumanGeo Group
AAPOR (American Association for Public Opinion Research) a professional organization dedicated to advancing
the study of “public opinion,” broadly defined, to include attitudes, norms, values, and behaviors
promotes best practices and transparency works to educate its members as well as policy
makers, the media, and the public at large to help them make better use of surveys and survey findings, and to inform them about new developments in the field
other task force reports available on https://www.aapor.org
Outline of our presentations What is Big Data? Paradigm shift Big Data activities in different organizations Skills required Big Data process and data quality
UNTIL RECENTLYthree main data sources
Administrative Data
Survey Data
Experiments
NOW
US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website.
Number of vehicles detected in the Netherlands on December 1, 2011 created by Statistics Netherlands (Daas et al. 2013). The vehicle size is shown in different colors; black is small size, red is medium size and green is large size.
Social media sentiment (daily, weekly and monthly) in the Netherlands, June 2010 - November 2013. The development of consumer confidence for the same period is shown in the insert (Daas and Puts 2014).
Big Data
http://www.rosebt.com/blog/data-veracity
Hope that found/organic data
Can replace or augment expensive data collections
More (= better) data for decision making
Information available in (nearly) real time
New paradigm
New business model Federal agencies no longer major players
New analytical model Outliers Finegrained analysis New units of analysis
New sets of skills Computer scientists Citizen scientists
Different cost structure
Source: Julia Lane
Eurostat Big Data Action Plan and Roadmap Pilots exploring the potential of selected big data
sources The project will also include activities on:
Methodological frameworks, Quality frameworks, Metadata frameworks, IT infrastructures, Communication, Legal frameworks, Ethical frameworks, Skills and training, and Experience sharing.
UNECE and Big Data The “ Sandbox” provides a computing environment to load
Big Data sets and tools Consumer price indices – experimenting with the
computation of price indexes Mobile telephone data – statistics on tourism and daily
commuting Smart meters – statistics on power consumption using data
collected from smart meter readings. Traffic loops – traffic statistics using data from traffic loops Social media – using Twitter data to analyze sentiment and
to tourism flows. Job portals – computing statistics on job vacancies Web scraping – tested methods for automatically collecting
data from web sources.
UNECE Big Data Inventory
Statistics Netherlands: Roadmap BIG DATA
Two focus projects: • the use of traffic loop data for transportation statistics• the use of mobile phone data for daytime population and tourism
statistics.
Six other projects:• the use of internet data for price statistics, • investigating the use of bank and credit card transactions, • the use of social media data for detecting trends in social cohesion, • the use of internet data for encoding enterprise purchases and sales,• investigating the use of smartcards of public transport for statistics,
and• the use of internet data for statistics about job vacancies.
18Source: Pieter Vlag, Statistics Netherlands
Examples from Statistics Sweden
Scanner data to improve the Household Budget Survey
Job vacancy statistics by scraping of the web To evalutate the use of AIS (Automatic
Identification System) data. Cooperation between Statistics Sweden and the agency for Transport Analysis (Trafa). Research funding from the Swedish Innovation Agency (Vinnova).
One day data
Source: Moström and Justesen, Statistics Sweden
SKILLSWhat tasks are required to get there?
We have to do this jointly …
Data Generating Process
Data Curation/Storage
Data Analysis
Data Output/Access
Examples: geolocated social media + survey+ administrative data
Example: Hadoop Distributed File System
Example: Hadoop MapReduce; High Frequency Data
Example: map visualization / privacy
Research QuestionsExamples: Behavior of interest (migration/political participation/job searches)
Source: Abe Usher
Big words …
What is big data?
What is Hadoop File System? (HDFS)
What is Hadoop MapReduce? (MR)
How do you link surveys with big data?
Source: Abe Usher
System Administrator• Storage systems
(MySQL, Hbase, Spark)• Cloud computing:
• Amazon Web Services (AWS)• Google Compute Engine
• Hadoop ecosystem
Computer scientist• Data preparation• MapReduce algorithms• Python/R programming• Hadoop ecosystem
Source: Abe Usher
RESEARCHWhat do we know about the data generating process?
Veracity
Who? What? Why?
Who is missing? Who is counted repeatedly?
What is not said / measured? ..and why?
But (at least) one more V
http://www.rosebt.com/blog/data-veracity
Terr
ori
st D
ete
ctor
Terr
ori
st D
ete
ctor
Errors in Big Data: An Illustration
Suppose 1 in 1,000,000 people are terrorists
The Big Data Terrorist Detector is 99.9 accurate
The detector says your friend, Jack is a terrorist.
What are the odds that Jack is
really a terrorist?
29Source: Paul Biemer
Terr
ori
st D
ete
ctor
Terr
ori
st D
ete
ctor
Suppose 1 in 1,000,000 people are terrorists
The Big Data Terrorist Detector is 99.9 accurate
The detector says your friend, Jack is a terrorist.
What are the odds that Jack is
really a terrorist?
30
Answer: 1 in 1000 i.e., 99.9% of the terrorist detections will be false!
Source: Paul Biemer
Errors in Big Data: An Illustration
Big Data Process Map
31
Generate
Source 1
Source 2
Source K
Extract
Transform (Cleanse)
ETL Analyze
Filter/Reduction (Sampling)
Computation/Analysis
(Visualization)
• • •
Load (Store)
Source: Paul Biemer
Big Data Process Map
32
Generation
Source 1
Source 2
Source K
Extract
Transform (Cleanse)
ETL Analyze
Filter/Reduction (Sampling)
Computation/Analysis
(Visualization)
• • •
Load (Store)
Errors include: low signal/noise ratio; lost signals; failure to capture; non-random (or non-representative) sources; meta-data that are lacking, absent, or erroneous.
Source: Paul Biemer
Big Data Process Map
33
Generation
Source 1
Source 2
Source K
Extract
Transform (Cleanse)
ETL Analyze
Filter/Reduction (Sampling)
Computation/Analysis
(Visualization)
• • •
Load (Store)
Errors include: specification error (including, errors in meta-data), matching error, coding error, editing error, data munging errors, and data integration errors..
Source: Paul Biemer
Big Data Process Map
34
Generation
Source 1
Source 2
Source K
Extract
Transform (Cleanse)
ETL Analyze
Filter/Reduction (Sampling)
Computation/Analysis
(Visualization)
• • •
Load (Store)
Data are filtered, sampled or otherwise reduced. This may involve further transformations of the data.
Errors include: sampling errors, selectivity errors (or lack of representativity), modeling errors
Source: Paul Biemer
Big Data Process Map
35
Generation
Source 1
Source 2
Source K
Extract
Transform (Cleanse)
ETL Analyze
Filter/Reduction (Sampling)
Computation/Analysis
(Visualization)
• • •
Load (Store)
Errors include: modeling errors, inadequate or erroneous adjustments for representativity, computation and algorithmic errors.
Source: Paul Biemer
POTENTIAL
We have to do this jointly …
Data Generating Process
Data Curation/Storage
Data Analysis
Data Output/Access
Examples: geolocated social media + survey+ administrative dataSocial Science & Psychology, Humanities, Econ, Business
Example: Hadoop Distributed File SystemMath & Computer Science, Applied Statistics
Example: Hadoop MapReduce; High Frequency DataEconomics, Social Sciences, Business, Math&Comp
Example: map visualization / privacyPsychology, Law, Math&Comp, Business
Research QuestionsExamples: Behavior of interest (migration/political participation/job searches)Any field
..and think about legal framework