on information quality: can your data do the job? (scecr 2015 keynote)
TRANSCRIPT
1
On Information Quality:
Can Your Data Do the Job?
SCECR 2015 @ Addis Ababa
Galit Shmueli @ Institute of Service Science
National Tsing Hua University, Taiwan
2
eBay dataset with millions of auctions• All camera auctions in 2008• All camera auctions in 2015• Auctions for all items in 1/2012• Sample from all items in 2012
How much will you pay?
Or maybe Craigslist data?
3
“statisticians working in a research environment… may well have to explain that the data are inadequate to answer a particular question.”
Statistics: A Very Short Introduction (Hand 2008)
Pre-data, post-data, post-analysis
What is the potential of a dataset to generate knowledge?
4
Analysis goal
g XAvailable data
fData analysis
method
Utility measure
U
5
Analysis goal
g
Domain GoalWhat, why, when, where, how
→ Analysis GoalExplain, predict, describeEnumerative, analyticExploratory, confirmatory
Quality of Goal Specification• “error of the third kind” - giving the right answer to the wrong
question – Kimball• “Far better an approximate answer to the right question, which
is often vague, than an exact answer to the wrong question, which can always be made precise” - Tukey
6
XAvailable data
Data Source• Primary, secondary• Observational, experiment• Single, multiple sources• Collection instrument, protocol
Data Type• Continuous, categorical, mix• Structured, un-, semi-structured• Cross-sectional, time series, panel,
network, geographical
Data Quality U(X|g)• “Zeroth Problem - How do the data relate to the problem, and
what other data might be relevant?” - Mallows• MIS/Database - usefulness of queried data to person querying it. • Quality of Statistical Data (IMF, OECD) - usefulness of summary
statistics for a particular goal (7 dimensions)
Data Size and Dimension• # observations• # variables
7
fData analysis
method
Analysis Quality• “poor models and poor analysis techniques, or even analyzing the
data in a totally incorrect way.” - Godfrey• Analyst expertise• Software availability• The focus of statistics/econometrics/DM education
Statistical models and methods • Parametric, semi-, non-parametric• Classic, Bayesian
Econometric models
Data mining algorithms
Graphical methodsOperations research methods
8
Utility measure
U
Quality of Utility Measure• Adequate metric from analysis standpoint (R2, holdout data)• Adequate metric from domain standpoint
Domain goal → Analysis goal
• Predictive accuracy, lift • Goodness-of-fit • Statistical power, statistical significance• Strength-of-fit• Expected costs, gains• Bias reduction, bias-variance tradeoff
Analysis utility → Domain utility
9
InfoQ(f,X,g) = U( f(X|g) )
Depends on quality of g, X, f, U and relationship between them
The potential of a particular dataset to achieve a particular goal using a given empirical analysis method
10
Statistical Approaches for Increasing InfoQ
Study Design (Pre-Data)• DOE• Clinical trials• Survey sampling• Computer experiments
Post-Data-Collection• Data cleaning and
preprocessing• Re-weighting, bias
adjustment• Meta analysis
Randomization, Stratification, Blinding, Placebo, Blocking, Replication, Sampling frame,Link data collection protocol with appropriate design
Recovering “real data” vs. “cleaning for the goal”Handling missing values, outlier detection, re-weighting, combining results
11
Assessing InfoQ“Quality of Statistical Data” (Eurostat, OECD, NCSES,…)• Relevance• Accuracy• Timeliness and punctuality• Accessibility• Interpretability• Coherence• Credibility
InfoQ dimensions1. Data resolution2. Data structure3. Data integration4. Temporal relevance5. Chronology of data and goal6. Generalizability7. Construct operationalization8. Communication
3 V’s of Big Data • Volume• Variety• Velocity
Marketing Research• Recency• Accuracy• Availability• Relevance
12
#1 Data ResolutionMeasurement scale and aggregation level
Dutch Flower AuctionsVan HeckKetterGuptaKoppiusLuKambil
How many flowers?How many auctions?Bidder level?
13
#2 Data Structure
Data Types• Time series, cross-sectional, panel• Geographic, spatial, network• Text, audio, video, semantic• Structured, semi-, non-structured• Discrete, continuous
Data Characteristics Corrupted and missing values due to study design or data collection mechanism
Social TV: Real-Time Media Response to TV AdvertisingHill, Nalavade & Benton, KDD 2012
Prediction in Economic NetworksDhar, Geva, Oestreicher-Singer & Sundararajan, ISR 2012
14
Consumer Surplus in Online AuctionsBapna, Jank & Shmueli, ISR 2008
#3 Data Integration
Utility of Linkage Dangers: Privacy Increase or decrease InfoQ?
Music Blogging, Online Sampling, and the Long Tail, Dewan & Ramprasad, ISR 2013
Song radio play data from Nielsen SoundScan + music blog sampling data from The Hype Machine
15
#4 Temporal Relevance
Analysis Timeliness(solving the right
problem too late)
Data Collection
Data Analysis
Study Deployment
t1 t2 t3 t4 t5 t6
Collection Timeliness(relevance to g)
g: Prospective vs. retrospective; longitudinal vs. snapshotNature of X, complexity of f
forecast
16
#5 Chronology of Data & Goalg1: Explain priceg2: Forecast price
Retrospective/prospectiveEx-post availabilityEndogeneity
17
#6 Generalizability
Statistical generalizability
Scientific generalizability
Definition of gChoice of X, f, U
The Hidden Cost of Accommodating Crowdfunder Privacy Preferences: A Randomized Field ExperimentBurtch, Ghose, Wattal
“We found that our treatment had a large, highly negative effect on hiding behavior…(β = -0.279, p < 0.001)”
“It is possible that our results would not extend to a [offline] purchase context, where issues of social capital, reputation, etc. might be less pronounced”
18
#7a Construct Operationalizationχ construct X = θ(χ) operationalization (measurable)
• Causal explanation vs. prediction, description
• Theory vs. data• Data: Questionnaire,
physical measurement
Online Seller
Reputation
Total #ratings
{#pos, #neg} ratings
# recent negative ratings
Text content of feedback
The Digitization of Word-of-Mouth: Promise and Challenges of Online Reputation SystemsDellarocas, Management Science
19
#7b Action OperationalizationDerive concrete actions from the information provided by a study
The Management Insights editor will write a Management Insights paragraph for every paper that is accepted for publication.
One Way Mirrors in Online Dating: A Randomized Experiment Bapna, Ramprasad, Shmueli & Umyarov
Online Reputation Systems: How to Design One That Does What You NeedDellarocas
Social TV: Real-Time Media Response to TV AdvertisingHill, Nalavade & Benton
20
#8 CommunicationVisual, written, verbal presentations & reports
Knowledge must reach the right person at the right time
• Mentoring• Manuscript reviewing• Data made available to
others• EDA and shared
visualization dashboards• Seminars + conferences!
21
“In the last three years, there has been a concerted effort by those in Washington to reduce government spending and reign in the national debt.
One reason for the budget cuts?
Research by two Harvard economists, Ken Rogoff and Carmen Reinhart. The pair found that when a country owes more than 90 percent of their GDP, it slides into recession.”
… Fixing this Excel error transforms high-debt countries from recession to growth
www.marketplace.org/topics/economy/excel-mistake-heard-round-world
22
a brief conversation about marriage equality with a canvasser who revealed that he or she was gay had a big, lasting effect on the voters’ views, as measured by separate online surveys administered before and after the conversation.
May 2015: Independent researchers fail to replicate; noted statistical irregularities in LaCour’s data (baseline outcome data statistically indistinguishable from a national survey ; over-time changes indistinguishable from perfect normally distributed noise)
High or Low InfoQ?
23
Assessing InfoQ in Practice
Rating-based assessment1-5 scale on each dimension:
InfoQ Score = [d1(Y1) d2(Y2) … d8(Y8)]1/8
Experience from two research methods courses
Shmueli and Kenett (2013), “An Information Quality (InfoQ) Framework for Ex-Ante and Ex-Post Evaluation of Empirical Studies”, Proceedings of IDAM 2013, Springer-Verlag.
Assessing Research Proposals: Ex-Ante (Prospective) InfoQ• 2009 Research methods workshop for PhD students• Help students develop and present research proposal• 50 students from OB, OR, marketing, economics
• Each student evaluates his/her proposal on the 8 InfoQ dimensions. Will proposal likely lead to the intended goal?
• Students’ grades derived from an InfoQ score of their proposal submission (presentation + written)
InfoQ Integration (Prospective)
Goal: Make students’ research journey more efficient and more effective
Assessing Research Proposals:Ex-Post (Retrospective) InfoQ
Professional masters degree program that emphasizes statistical practice, methods, data analysis and practical workplace skills. The MSP is for students who are interested in professional careers in business, industry, government, or scientific research.
Masters Program in Statistical Practice
Assessing Research Proposals:Ex-Post (Retrospective) InfoQ
In 2012: 16 studentsStudents instructed to: (60-90 min)1. Read short description of InfoQ and its dimensions2. Read reports on five studies 3. Describe goal, data, analysis, utility for each study4. Evaluate the 8 dimensions for each study
Predicting days with unhealthy air quality in Washington DC
Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators
Quality-of-care factors in U.S. nursing homes
Predicting ZILLOW.com’s Zestimate accuracy
goo.gl/erNPF
Predicting First Day Returns for Japanese IPOs
Feedback:
InfoQ approach helped “sort out all of the information”
Several reported that they will adopt this evaluation approach for future studies
30
InfoQ: Strengths and ChallengesInfoQ approach streamlines questioning of data value• “Why should we invest in data?” – management• Compare value of potential datasets, analyses• Prioritize/rank projects• Strengthen functional – analytical relationship• Evaluation study: InfoQ useful for developing an analysis plan
(prospective) and evaluating an empirical study (retrospective)
To Do:• Improve InfoQ assessment (heterogeneity across different raters)• Alternative InfoQ assessment approaches (pilot study, EDA, other)• Further dimensions (data privacy, human subject compliance & risk)• Effect of technological advances on InfoQ
31In progress…
With discussion