the present and future of statistics challenges and ...davidian/jsm2014.pdf · was blind to the...
TRANSCRIPT
The Present and Future ofStatistics
Challenges and Opportunities
Marie Davidian
Department of StatisticsNorth Carolina State University
http://www4.stat.ncsu.edu/~davidian
1/47
New York Times, August 6, 2009
“I keep saying that the sexy job in the next 10 years will bestatisticians” – Hal Varian, Chief Economist, Google
2/47
McKinsey Report, 2011
2011 McKinsey Global Institute report:
Big data: The next frontier for innovation,competition, and productivity
“A significant constraint. . . will be a shortage of . . . people withdeep expertise in statistics and data mining. . . a talent gap of
140K - 190K positions in 2018 (in the US)”
http://www.mckinsey.com/insights/mgi/research/technology and innovation/big data the next frontier for innovation
3/47
The Wall Street Journal, March 1, 2013
4/47
Advanced placement statistics
2000 2005 2010
050
100
150
200
Year
Num
ber
of S
tude
nts
(tho
usan
ds)
AP Statistics Exam Participation, 1997−2013
5/47
Rock star
Nate Silver attracted ∼ 4,000 statisticians to his president’sinvited address at the 2013 Joint Statistical Meetings
6/47
Bestsellers
7/47
Bestsellers
7/47
The Wall Street Journal, November 15, 2013
“Statistics is cool” – Ron Wasserstein, ASA Executive Director
8/47
Rock stars
2013 MacArthur Fellow Susan Murphy
9/47
Rock stars
2013 Prime Minister’s Prize for Science (Australia) recipientTerry Speed
10/47
Rock stars
Sir David Spiegelhalter (aka “Professor Risk”)
11/47
Essential
“Statistical rigor is necessary to justify the inferential leap fromdata to knowledge”
12/47
Big opportunities
The opportunities for statistics to have major impact are endless
13/47
Big challenge
Big Data and data science
14/47
Big hype
15/47
Big hype, receding
“We are now past the ‘peak of inflated expectations’ of the hypecycle” (http://en.wikipedia.org/wiki/Hype_cycle)
16/47
Investment
17/47
Investment
18/47
Investment
19/47
What is data science?
20/47
Big challenge remains
• Perception that other disciplines are more relevant todata-driven science and business
• Computer science, machine learning, mathematics,engineering, physics, analytics, . . .
• Perception that statistics is old-fashioned and rigid• Institutes, centers, degree and certificate programs; in
many cases statistics is MIA
21/47
For example
“Statistics departments and journals still strongly emphasize a verynarrow range of topics and methods and techniques, all driven by a
tiny handful of results, many dating from the 1930s. . . the olderzombie methods persist in the statistics literature and teaching.”
22/47
For example
Science, February 11, 2011
“My impression is that scien-tists view statistics not so muchas a science but as a ‘bag oftools.’ You have a visibility prob-lem in Science and AAAS.” –
Alan Leshner, Chief ExecutiveOfficer of the American Asso-ciation for the Advancement ofScience (AAAS), to representa-tives of the ASA in 2011
23/47
Weird juxtaposition
• We continue to have daily impact in our criticalcollaborative roles
• We have gotten lots of great press• Enrollments in undergraduate programs and applications
to graduate programs are skyrocketing• But statistics is still often absent from the discourse on Big
Data and data science• And the importance of statistical rigor is sometimes lost on
fellow scientists• Statistics and statistical principles continue to be
misunderstood
24/47
Weird juxtaposition
• We continue to have daily impact in our criticalcollaborative roles
• We have gotten lots of great press• Enrollments in undergraduate programs and applications
to graduate programs are skyrocketing
• But statistics is still often absent from the discourse on BigData and data science
• And the importance of statistical rigor is sometimes lost onfellow scientists
• Statistics and statistical principles continue to bemisunderstood
24/47
Weird juxtaposition
• We continue to have daily impact in our criticalcollaborative roles
• We have gotten lots of great press• Enrollments in undergraduate programs and applications
to graduate programs are skyrocketing• But statistics is still often absent from the discourse on Big
Data and data science• And the importance of statistical rigor is sometimes lost on
fellow scientists• Statistics and statistical principles continue to be
misunderstood
24/47
ASA impact
The American Statistical Association has undertaken numerousinitiatives to
• Promote the role of statistics in all the sciences• Highlight the unique perspectives statistics brings to
business and policy• Increase awareness of opportunities for statisticians
among studentsand thereby enhance our impact on science and society
25/47
Key ASA Staff
Ron Wasserstein Steve Pierson Jeff Myers
26/47
AAAS and Science
AAAS – A focal point for science• World’s largest general scientific society, ∼120K members• 261 affiliated scientific societies (including ASA)• Publisher of Science• 24 AAAS sections, including Section U on Statistics
2013 presidential initiative• Raise the profile of statistics within AAAS
27/47
AAAS and Science
• Button campaigns encouraging statisticians to join AAAS• Section U membership increased 14% from 2012 to 2013• Section U invited session proposals for AAAS Annual
Meetings• Nominations for AAAS Fellow
28/47
Impacting AAAS and Science
September 26, 2013• A group of ASA representatives met with Alan Leshner• Very positive reception!• We also met with new Science editor-in-chief Marcia
McNutt and several senior editors• Great interest in enhancing the role of statistics and
statisticians
29/47
Impacting Science
Science Editor-in-Chief Marcia McNutt
30/47
Science, January 17, 2014
Reproducibility SCIENCE ADVANCES ON A FOUNDATION OF TRUSTED DISCOVERIES. REPRODUCING AN EXPERIMENT
is one important approach that scientists use to gain confidence in their conclusions.
Recently, the scientifi c community was shaken by reports that a troubling proportion of
peer-reviewed preclinical studies are not reproducible. Because confi dence in results is of
paramount importance to the broad scientifi c community, we are announcing new initiatives
to increase confi dence in the studies published in Science. For preclinical studies (one of the
targets of recent concern), we will be adopting recommendations of the U.S. National Insti-
tute of Neurological Disorders and Stroke (NINDS) for increasing transparency.* Authors
will indicate whether there was a pre-experimental plan for data handling (such as how to
deal with outliers), whether they conducted a sample size estimation to ensure a suffi cient
signal-to-noise ratio, whether samples were treated randomly, and whether the experimenter
was blind to the conduct of the experiment. These criteria will be
included in our author guidelines.
There are a number of reasons why peer-reviewed preclinical
studies may not be reproducible. The system under investigation may
be more complex than previously thought, so that the experimenter
is not actually controlling all independent variables. Authors may not
have divulged all of the details of a complicated experiment, making
it irreproducible by another lab. It is also expected that through ran-
dom chance, a certain number of studies will produce false positives.
If researchers are not alert to this possibility and have not set appro-
priately stringent signifi cance tests for their results, the outcome is a
study with irreproducible results. Although there is always the possi-
bility that an occasional study is fraudulent, the number of preclinical
studies that cannot be reproduced is inconsistent with the idea that all
irreproducibility results from misconduct in such research.
It is unlikely that the issues with irreproducibility are confi ned to preclinical studies
(social science has been equally noted, for example). Unfortunately, there are no equivalents
to the NINDS recommendations for other disciplines that provide a basis for requiring trans-
parency across all fi elds. For the next 6 months, we will be asking reviewers and editors to
identify papers submitted to Science that demonstrate excellence in transparency and instill
confi dence in the results. This will inform the next steps in implementing reproducibility
guidelines. Science Translational Medicine, a sister journal of Science, already enforces the
NINDS guidelines for preclinical studies. Both journals also are open to improving on the
NINDS recommendations for preclinical studies.
There is also a wide range of sophistication in the application of statistics displayed in
research analysis, ranging from practically no statistics, to the routine use of generic soft-
ware packages, to the application of advanced methods that extract subtle signals from noise.
Because reviewers who are chosen for their expertise in the subject matter of a study may not
be authorities in statistics as well, statistical errors in manuscripts may slip through unde-
tected. For that reason, with the advice of the American Statistical Association and others,
we are adding new members to our Board of Reviewing Editors from the statistics commu-
nity to ensure that manuscripts receive appropriate scrutiny in their methods of data analysis.
Science’s standards have always been high, and these measures add to steps we have
already taken to increase transparency, such as requiring data accessibility. Nevertheless,
journals can only do so much to assure readers of the validity of the studies they publish.
The ultimate responsibility lies with authors to be completely open with their methods, all of
their fi ndings, and the possible pitfalls that could invalidate their conclusions.
10.1126/science.1250475
– Marcia McNutt
229
EDITORIAL
CR
ED
ITS: (T
OP
) STA
CE
Y P
EN
TLA
ND
PH
OT
OG
RA
PH
Y; (R
IGH
T) O
NE
O2/I
ST
OC
KP
HO
TO
.CO
M
Marcia McNutt is Editor-
in-Chief of Science.
www.sciencemag.org SCIENCE VOL 343 17 JANUARY 2014
*S. C. Landis et al., Nature 490, 187 (2012).
Published by AAAS
on
July
16,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
31/47
Science, July 4, 2014
SCIENCE sciencemag.org 4 JULY 2014 • VOL 345 Issue 6192 9
Numbers. Lots and lots of numbers. It is hard to
find a paper published in Science or any other
journal that is not full of numbers. Interpreta-
tion of those numbers provides the basis for the
conclusions, as well as an assessment of the con-
fidence in those conclusions. But unfortunately,
there have been far too many cases where the
quantitative analysis of those numbers has been flawed,
causing doubt about the authors’ interpretation and
uncertainty about the result. Furthermore, it is not re-
alistic to expect that a technical reviewer, chosen for
her or his expertise in the topical subject matter or ex-
perimental protocol, will
also be an expert in data
analysis. For that reason,
with much help from the
American Statistical As-
sociation, Science has es-
tablished, effective 1 July
2014, a Statistical Board
of Reviewing Editors
(SBoRE), consisting of ex-
perts in various aspects of
statistics and data analy-
sis, to provide better over-
sight of the interpretation
of observational data.
For those familiar with
the role of Science’s Board
of Reviewing Editors
(BoRE), the function of
the SBoRE will be slightly
different. Members of the
BoRE perform a rapid
quality check of manu-
scripts and recommend
which should receive in-
depth review by techni-
cal specialists. Members
of the SBoRE will receive manuscripts that have been
identified by editors, BoRE members, or possibly review-
ers as needing additional scrutiny of the data analysis or
statistical treatment. The SBoRE member assesses what
the issue is that requires screening and suggests experts
from the statistics community to provide it.
So why is Science taking this additional step? Read-
ers must have confidence in the conclusions published
in our journal. We want to continue to take reasonable
measures to verify the accuracy of those results. We be-
lieve that establishing the SBoRE will help avoid honest
mistakes and raise the standards for data analysis, par-
ticularly when sophisticated approaches are needed. But
even when taking added precautions, no review system
is infallible, and no combination of reviewers can be ex-
pected to uncover all of the ways in which the interpre-
tation of results may have gone wrong. In particular, it
is difficult for reviewers to detect whether authors have
approached the study with a lack of bias in their data
collection and presentation.
I recall a study that I conducted years ago involving
a global analysis of some oceanographic features that
I was modeling to understand the rheology of oceanic
plates on million-year time scales. I had only a handful
of data points—perhaps a
dozen or so—and the fit to
my model failed a signifi-
cance test. Clearly, throw-
ing out a few of the data
points by declaring them
“outliers” would have im-
proved the fit dramati-
cally, and in fact I even
recall a reviewer of the
paper commenting: “Can’t
you make the data fit the
model better?”
Really?
The editor published
the paper despite the
lousy fit of the model to
the data. It was not too
long before it was real-
ized that those “outliers”
were the key to a more
complete understanding
of the long-term rheologi-
cal behavior of the oce-
anic plates. Although the
model in the earlier paper
needed an overhaul, the
fundamental observations, because they were presented
without bias, inspired much further progress in the field.
In the years since, I have been amazed at how many
scientists have never considered that their data might be
presented with bias. There are fundamental truths that
may be missed when bias is unintentionally overlooked,
or worse yet, when data are “massaged.” Especially as we
enter an era of “big data,” we should raise the bar ever
higher in scrutinizing the analyses that take us from ob-
servations to understanding.
Raising the bar
Marcia McNutt is
Editor-in-Chief of
Science.
EDITORIAL
– Marcia McNutt
10.1126/science.1257891
“Readers must have confidence in the conclusions published
in our journal.”
IMA
GE
S:
(RIG
HT
) S
TA
CE
Y P
EN
TL
AN
D P
HO
TO
GR
AP
HY
; (I
NS
ET
) S
OR
BE
TT
O/
IST
OC
KP
HO
TO
.CO
M
Published by AAAS
on
July
16,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
32/47
Science Statistics Board of Reviewing Editors
“. . . with much help from the American Statistical Association,Science has established, effective 1 July 2014, a StatisticalBoard of Reviewing Editors (SBoRE), consisting of experts invarious aspects of statistics and data analysis, to provide betteroversight of the interpretation of observational data.”
“I have been amazed at how many scientists have neverconsidered that their data might be presented with bias. . . Especially as we enter an era of ‘big data,’ we should raisethe bar ever higher in scrutinizing the analyses that take usfrom observations to understanding.”
33/47
Science Statistics Board of Reviewing Editors
Ron Brookmeyer, UCLAAlison Motsinger-Reif, NC State UniversityGiovanni Parmigiani, Dana-Faber Cancer InstituteRichard Smith, University of North Carolina at Chapel HillJane-Ling Wang, University of California, DavisChris Wikle, University of MissouriIan A. Wilson, The Scripps Research Institute
34/47
Science Statistics Perspectives
35/47
Impacting US federal research priorities
Whitepapers• Make case that statisticians are essential to tackling
national research priorities• Drive research funding• National Science Foundation (NSF)• White House Office of Science and Technology Policy
(OSTP)• Modeled after the success of Computing Community
Consortium (CCC) whitepapers• See Steve’s article in the July 2014 Amstat News
36/47
OSTP Big Data initiative
1
Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society
July 2, 2014
A Working Group of the American Statistical Association1
Summary:
The Big Data Research and Development Initiative is now in its third year and making great strides to address the challenges of Big Data. To further advance this initiative, we describe how statistical thinking can help tackle the many Big Data challenges, emphasizing that often the most productive approach will involve multidisciplinary teams with statistical, computational, mathematical, and scientific domain expertise. With a major Big Data objective of turning data into knowledge, statistics is an essential scientific discipline because of its sophisticated methods for statistical inference, prediction, quantification of uncertainty, and experimental design. Such methods have helped and will continue to enable researchers to make discoveries in science, government, and industry. The paper discusses the statistical components of scientific challenges facing many broad areas being transformed by Big Data—including healthcare, social sciences, civic infrastructure, and the physical sciences—and describes how statistical advances made in collaboration with other scientists can address these challenges. We recommend more ambitious efforts to incentivize researchers of various disciplines to work together on national research priorities in order to achieve better science more quickly. Finally, we emphasize the need to attract, train, and retain the next generation of statisticians necessary to address the research challenges outlined here.
1 Authors: Cynthia Rudin, MIT (Chair); David Dunson, Duke University; Rafael Irizarry, Harvard University; Hongkai Ji, Johns Hopkins University; Eric Laber, North Carolina State University; Jeffrey Leek, Johns Hopkins University; Tyler McCormick, University of Washington; Sherri Rose, Harvard University; Chad Schafer, Carnegie Mellon University; Mark van der Laan, University of California, Berkeley; Larry Wasserman, Carnegie Mellon University; Lingzhou Xue, Pennsylvania State University. Affiliations are for identification purposes only and do not imply an institution’s endorsement of this document.
Cynthia Rudin, ChairDavid Dunson, Rafael Irizarry,Hongkai Ji, Eric Laber,Jeff Leek, Tyler McCormick,Sherri Rose, Chad Schafer,Mark van der Laan,Larry Wasserman,Lingzhou Xue
37/47
OSTP BRAIN initiative
STATISTICAL RESEARCH AND TRAININGUNDER THE BRAIN INITIATIVE
A Working Group of the American Statistical Association∗
April 2014
1 Introduction and Summary
The BRAIN (Brain Research through Advancing Innovative Neurotechnologies) Initiative aims to produce a so-phisticated understanding of the link between brain and behavior and to uncover new ways to treat, prevent andcure brain disorders.1 Success in meeting these multifaceted challenges will require scientific and technologicalparadigms that incorporate novel statistical methods for data acquisition and analysis. Our purpose here is tosubstantiate this proposition, and to identify implications for training.
Brain research relies on a wide variety of existing methods for collecting human and animal neural data, in-cluding neuroimaging (radiography, fMRI, MEG, PET), electrophysiology from multiple electrodes (EEG, ECoG,LFP, spike trains), calcium imaging, optical imaging, optogenetics, and anatomical methods (diffusion imaging,electron microscopy, fluorescent microscopy). Each of these modalities produces data with its own set of sta-tistical and analytical challenges. As neuroscientists improve these techniques and develop new ones, dataare being acquired at very large scales. For example, advances in multiple-electrode recording and two-photoncalcium imaging have led to an exponential growth in the size of neural populations that can be observed si-multaneously, at single-cell resolution (2; 3; 52). Similarly, new anatomical methods have led to a rapid rise inthe size and the scale of data, and the resulting level of detail with which brain structure can be investigated(17; 10; 32; 39). Furthermore, both new and existing technologies are often used together, and are increasinglyaccompanied by rich characterizations of individuals and their behavior, ranging from genetic information tosensor-based monitors of activity.
∗The working group included Robert E. Kass, Carnegie Mellon University (Chair); Genevera Allen, Rice University; Brian Caffo,Johns Hopkins University; John Cunningham, Columbia University; Uri Eden, Boston University; Timothy D. Johnson, University ofMichigan; Martin A. Lindquist, Johns Hopkins University; Thomas A Nichols, University of Warwick; Hernando Ombao, University ofCalifornia, Irvine; Liam Paninski, Columbia University; Russell T. Shinohara, University of Pennsylvania; Bin Yu, University of California,Berkeley. Affiliations are for identification purposes only and do not imply an institution’s endorsement of this document.
1http://www.whitehouse.gov/share/brain-initiative
1
Rob Kass, ChairGenevera Allen, Brian Caffo,John Cunningham,Uri Eden, Timothy Johnson,Martin Lindquist,Thomas Nichols,Hernando Ombao,Liam Paninski,Russell T. Shinohara, Bin Yu
38/47
OSTP climate change initiative
Statistical Science: Contributions to the Administration’s
Research Priority on Climate Change
April 2014
A White Paper of the American Statistical Association’s Advisory Committee for
Climate Change Policy1
EXECUTIVE SUMMARY
Data are fundamental to all of science. Data enhance scientific theories and their statistical
analysis suggests new avenues of research and data collection. Climate science is no exception.
Earth’s climate system is complex, involving the interaction of many different kinds of physical
processes and many different time scales. Thus this area of science has a critical dependence on
the examination of all relevant data and the application of statistics for its interpretation. Climate
datasets are increasing in number, size, and complexity and challenge traditional methods of data
analysis. Satellite remote sensing campaigns, automated weather monitoring networks, and
climate-model experiments have contributed to a data explosion that provides a wealth of new
information but can overwhelm standard approaches. Developing new statistical approaches is an
essential part of understanding climate and its impact on society in the presence of uncertainty.
Experience has shown that rapid progress can be made when “big data” is used with statistics to
derive new technologies. Crucial to this success are new statistical methods that recognize
uncertainties in the measurements and the scientific processes but are also tailored to the unique
scientific questions being studied.
This white paper makes the case for the National Science Foundation (NSF) to establish an
interdisciplinary research program around climate, where statisticians have the opportunity to
collaborate with researchers from other disciplines to advance the understanding of the climate
system (e.g., quantification of uncertainties, the development of powerful tests of scientific
hypotheses). Although NSF supports basic and applied statistical research, these efforts often do
not involve scientists and statisticians in partnerships or in teams to address problems in climate
science. This program would also address the critical need for training a new generation of
interdisciplinary researchers who can tackle challenging scientific problems that require complex
data analysis by developing and using the necessary sophisticated statistical methods.
1 Authors: Bruno Sanso, University of California, Santa Cruz (Chair); L. Mark Berliner, Ohio State
University; Daniel S. Cooley, Colorado State University; Peter Craigmile, Ohio State University; Noel A.
Cressie, University of Wollongong; Murali Haran, Pennsylvania State University; Robert B. Lund, Clemson
University; Douglas W. Nychka, National Center for Atmospheric Research; Chris Paciorek, University of
California, Berkeley; Stephan R. Sain, National Center for Atmospheric Research; Richard L Smith,
Statistical and Applied Mathematical Sciences Institute; Michael L. Stein, University of Chicago.
Affiliations are for identification purposes only and do not imply an institution’s endorsement of this
document.
Bruno Sanso, ChairMark Berliner, Daniel Cooley,Peter Cragmile, Noel Cressie,Murali Haran, Robert Lund,Doug Nychka, Chris Paciorek,Stephan Sain, Richard Smith,Michael Stein
39/47
Impacting our role in Big Data/data science
Big Data/data science initiative by the ASA presidents (Bob,Marie, Nat) (See my June 2013 Amstat News column)
• Workgroup on curriculum development• Meetings between ASA representatives and stakeholders
from business and technology companies at the forefront ofdata science (Alexandria, Cincinnati, New York, Palo Alto)
• Training in text analytics at 2014 CSP and JSM
40/47
Data science meetings
Major messages• Great shortage of statistical talent, concerns over pipeline• Concerns over ability of fresh PhDs to work independently,
identify problems• Concerns over computational and data manipulation skills
– Python favored over R• Communication, collaboration, and leadership skills• Must be able to “make it to the middle”
41/47
Preparing statisticians to make an impact
• Report of the curriculum development workgroup, 2014JSM panel (Monday)
• Professional skills development• Training in statistical leadership (see Nat’s May 2014
Amstat News column)
42/47
Public relations campaign
First national PR campaign for statistics• Promote the profession, increase visibility• Students and those who advise/influence them• Careers in statistics, importance of statistical literacy in
everyday life• Consulting with Stanton Communications• Not just an “ASA thing”• See Nat’s June 2014 Amstat News column
43/47
This is Statistics
“It’s not what you think it is.”
44/47
This is Statistics website
http://www.thisisstatistics.org
(going live on August 18, 2014)45/47
This is Statistics approaches
• Profiles of young statisticians in “cool” positions• Exploit social media• Pitch statistics career stories to the media
46/47
Making an impact
I have touched on just a few of the many things the ASA isdoing to enhance the impact of statistics
Please contact me or Ron ([email protected]) if you areinterested in participating
47/47