soda 501: approaches and issues in big social data spring ... · \approaches and issues in big...
TRANSCRIPT
SoDA 501 Approaches and Issues in Big Social DataSpring 2018
Burt L MonroeOffice Sparks B002 (The Databasement) or Pond 207Course website httpsburtmonroegithubioSoDA501Appointments httpburtmonroeyoucanbookmeContact burtmonroepsuedu 814-867-2726 or 814-865-9215
Description
This seminar is part of the core seminar series for students in the Social Data Analytics dual-titlePhD and doctoral minor The primary objective of the seminar is interdisciplinary exposure toengagement with and integration of the tools practices language and standards used in the col-lection and management of data in the component disciplines of the Social Data Analytics field
Each of you is well on your way toward a PhD ndash formal certification as an ldquoexpertrdquo ndash in one of thecomponent disciplines of Social Data Analytics and has in your coursework and research becomewell versed in one or more of the many computational informational statistical visual analyticor social scientific approaches to data and the issues faced by those approaches Here we areinterested in trying to integrate your multidisciplinary expertise particularly in the context of datathat are social (about or arising from human interaction) and big or intensive (of sufficient scalevariety or complexity to strain the informational computational or cognitive limits of conventionalapproaches to data collection management manipulation or analysis)
The SoDA core seminars are organized around the metaphor of the social data stack The socialdata stack consists of three fuzzily boundaried layers the ldquodata layerrdquo the ldquoanalytics layerrdquo andthe ldquorelevance layerrdquo (Fig 1)
The data layer is comprised of the processes and technologies by which human interactions aretranslated into data about human interactions These are the themes emphasized in SoDA 501ldquoApproaches and Issues in Big Social Datardquo offered in the spring semester Some SoDA IGERTstudents will take more in depth seminars with focus on computational and informational aspects(primarily in Information Sciences amp Technology Geography or engineering departments) and re-search design aspects (primarily in social science departments or Statistics) of the data layer
The analytics layer is comprised of the processes and technologies by which social data are translatedinto knowledge about society These are the themes emphasized in SoDA 502 ldquoApproaches andIssues in Social Data Analyticsrdquo Some of you will take more in-depth seminars on machine statistical learning visual analytics or other statistical or social scientific approaches to inferenceThe relevance layer is comprised of the processes and technologies by which knowledge about societyis translated into value for science or society Within the SoDA seminars this is addressed throughprimarily through exposure to and participation in projects that require an interdisciplinary teamscience approach
1
Figure 1 SoDA and the social data stack
Assignments and Grades
The latter leads us to the main pedagogical components of SoDA 501
bull Engagement in Seminar - 40
Guest Speakers For half of the session most weeks we will host a guest speaker(typically a member of the Graduate Faculty in Social Data Analytics drawn from thefull range of participating disciplines) discussing an active research project or relatedtopic that touches on one or more areas of concern in the course For each speaker wewill have two or more of you acting as ldquodesignated respondentsrdquo with extra responsibilityfor having questions for discussion with the speaker
Readings and Seminar Discussion The readings discussion and what lecturing Iwill do will focus on interdisciplinary integration In part this involves identifying thoseconcepts that may be new to some of you in this setting ndash eg how ldquobigrdquo or ldquomachine
2
learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by Wednesday 700am each week by email ndash listsof terms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing
Grading Criteria Full 30 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between
bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea small number of assigned exercises Early in the semester these will be done in (assigned)interdisciplinary teams Later exercises will be individual
bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo datasetThe data must be at least partly social (arise from human interactions) There must be somenontrivial computational or informational element to the project There need not be a finalanalysis of the data but there must be some basic calculation of descriptive statistics over thedata and some demonstration of the validity of the data for the (or an) intended scientificpurpose ndash eg representativeness (and of what) balance randomization measurementvalidity etc
March 1 - Deadline for approval of teams and (proposed) projects
March 15 - 25 Project Review
March 29 - 50 Project Review
April 12 - 75 Project Review
April 26 - Team Project Presentations
May 3 - White Paper and Data Replication Archive Due Submit a 4-5 pagepaper that documents what was done and why discusses problems you encounteredprovides an assessment of the validity of the data for an analytic purpose (or how itmight be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar)
3
Course Schedule 2018
January 11
bull Introductions
bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)
bull Further reference
10RulesforData SoftwareCarpentry Lessons
Section ldquoGeneral Resources for Python and Rrdquo
Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)
bull Exercise 1 Team Updates 118 Due 125 BitByBit Exercise 26 (a-g) Teams
TeamFrancisco Sara Claire Rosemary Fangcao
TeamFreelin Brittany Arif So Young Xiaoran
TeamKankane Shipi Steve Lulu Omer
January 18
bull Readings (send list of confusing terms concepts by 700am Jan 17 Wednesday)
BitByBit Ch 1 amp 2
CompSocSci Monroe-No Monroe-5Vs
One article from a different discipline listed in the ldquoMultidisciplinary Perspectivesrdquosection excluding Business-BigData
bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo
bull Exercise 1 Team Updates
January 25
bull Readings (send list of confusing terms concepts by 700 am Jan 24 Wednesday)
GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census
ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics
bull Further reference
Section ldquoiexclCuidadordquo
Section ldquoMeasurement Reliability and Validityrdquo
bull Exercise 1 Due Teams Report
4
bull Determine ldquodiscussion leadrdquo dates
bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams
TeamKelling Claire Brittany Arif
TeamYalcin Omer Lulu Fangcao
TeamPang Rosemary Shipi Sara
TeamSun Xiaoran Steve So Young
February 1
bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo
bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)
Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j
tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10
1016jtourman201607005)
UnobtrusiveMeasures
Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize
ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo
bull Further reference
Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo
Section ldquoMultiple Measures Latent Variable Measurementrdquo
Section ldquoSampling and Survey Designrdquo
Section ldquoOpen data APIs linked data rdquo
Section ldquoSpace and Timerdquo (readings on Time)
February 8
bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)
Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=
0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf
dl=0
5
InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)
FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp
fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)
bull Further reference
Section ldquoSpace and Timerdquo
Section ldquoWeb Scrapingrdquo
Section ldquoData Representations Data Mappingsrdquo
Section ldquoFeature selection feature extraction feature engineering rdquo
Section ldquoDatabases and data managementrdquo (esp SQL)
Section ldquoLanguage Text Speech Audiordquo
bull Exercise 2 Due Teams Report
February 15 (Postponed)
bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)
Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http
ieeexploreieeeorgdocument8258219
February 22
bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo
bull Readings (two weeks worth)
David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf
Similarity
PatternRecognition Ch2 ldquoRepresentationrdquo
6
MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)
bull Further reference
Section ldquoSimilarity Distance Association rdquo
Section ldquoDerived Data Representations rdquo
Section ldquoRecord Linkage Entity Resolution rdquo
Section ldquoClustering Hashing Compression rdquo
March 1
bull Reading
BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)
Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo
HumanComputation Chs 125
bull Further reference
Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo
Section ldquoMaking Up Data (smoothing convolution kernels )rdquo
Section ldquoVision image videordquo
bull Semester Projects Must be Approved Before Spring Break
March 8 - Spring Break
March 15
bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo
bull Reading
Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)
Networks Chs 1 2
MMDS Ch 5 ldquoLink Analysisrdquo
MarkovVisually
bull Further references
Section ldquoNetworks and Graphsrdquo
7
Section ldquoSimulation resampling Markov Chains rdquo
DeepLearning (different kind of network)
bull 25 Project Review
March 22
bull Reading
DeepLearning Ch2 ldquoLinear Algebrardquo
Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)
bull Further reference
Section ldquoDimensionality reduction decomposition rdquo
Section ldquoMultiple measures latent variable measurementrdquo
Section ldquoLinear algebra matrix computationsrdquo
March 29
bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo
bull Reading
Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf
MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)
TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)
MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)
bull Further reference
Section ldquoNonlinear dimension reduction manifold learningrdquo
Section ldquoDatabases and data managementrdquo (esp NoSQL)
Section ldquoTheoretically-structured approaches to data wranglingrdquo
Section ldquoParallelism MapReduce Split-Apply-Combinerdquo
Section ldquoFunctional programmingrdquo
Section ldquoCutting and bleeding edge rdquo
bull 50 Project Review
bull Exercise 3 (Individual) Due 45
8
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Figure 1 SoDA and the social data stack
Assignments and Grades
The latter leads us to the main pedagogical components of SoDA 501
bull Engagement in Seminar - 40
Guest Speakers For half of the session most weeks we will host a guest speaker(typically a member of the Graduate Faculty in Social Data Analytics drawn from thefull range of participating disciplines) discussing an active research project or relatedtopic that touches on one or more areas of concern in the course For each speaker wewill have two or more of you acting as ldquodesignated respondentsrdquo with extra responsibilityfor having questions for discussion with the speaker
Readings and Seminar Discussion The readings discussion and what lecturing Iwill do will focus on interdisciplinary integration In part this involves identifying thoseconcepts that may be new to some of you in this setting ndash eg how ldquobigrdquo or ldquomachine
2
learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by Wednesday 700am each week by email ndash listsof terms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing
Grading Criteria Full 30 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between
bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea small number of assigned exercises Early in the semester these will be done in (assigned)interdisciplinary teams Later exercises will be individual
bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo datasetThe data must be at least partly social (arise from human interactions) There must be somenontrivial computational or informational element to the project There need not be a finalanalysis of the data but there must be some basic calculation of descriptive statistics over thedata and some demonstration of the validity of the data for the (or an) intended scientificpurpose ndash eg representativeness (and of what) balance randomization measurementvalidity etc
March 1 - Deadline for approval of teams and (proposed) projects
March 15 - 25 Project Review
March 29 - 50 Project Review
April 12 - 75 Project Review
April 26 - Team Project Presentations
May 3 - White Paper and Data Replication Archive Due Submit a 4-5 pagepaper that documents what was done and why discusses problems you encounteredprovides an assessment of the validity of the data for an analytic purpose (or how itmight be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar)
3
Course Schedule 2018
January 11
bull Introductions
bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)
bull Further reference
10RulesforData SoftwareCarpentry Lessons
Section ldquoGeneral Resources for Python and Rrdquo
Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)
bull Exercise 1 Team Updates 118 Due 125 BitByBit Exercise 26 (a-g) Teams
TeamFrancisco Sara Claire Rosemary Fangcao
TeamFreelin Brittany Arif So Young Xiaoran
TeamKankane Shipi Steve Lulu Omer
January 18
bull Readings (send list of confusing terms concepts by 700am Jan 17 Wednesday)
BitByBit Ch 1 amp 2
CompSocSci Monroe-No Monroe-5Vs
One article from a different discipline listed in the ldquoMultidisciplinary Perspectivesrdquosection excluding Business-BigData
bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo
bull Exercise 1 Team Updates
January 25
bull Readings (send list of confusing terms concepts by 700 am Jan 24 Wednesday)
GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census
ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics
bull Further reference
Section ldquoiexclCuidadordquo
Section ldquoMeasurement Reliability and Validityrdquo
bull Exercise 1 Due Teams Report
4
bull Determine ldquodiscussion leadrdquo dates
bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams
TeamKelling Claire Brittany Arif
TeamYalcin Omer Lulu Fangcao
TeamPang Rosemary Shipi Sara
TeamSun Xiaoran Steve So Young
February 1
bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo
bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)
Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j
tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10
1016jtourman201607005)
UnobtrusiveMeasures
Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize
ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo
bull Further reference
Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo
Section ldquoMultiple Measures Latent Variable Measurementrdquo
Section ldquoSampling and Survey Designrdquo
Section ldquoOpen data APIs linked data rdquo
Section ldquoSpace and Timerdquo (readings on Time)
February 8
bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)
Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=
0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf
dl=0
5
InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)
FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp
fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)
bull Further reference
Section ldquoSpace and Timerdquo
Section ldquoWeb Scrapingrdquo
Section ldquoData Representations Data Mappingsrdquo
Section ldquoFeature selection feature extraction feature engineering rdquo
Section ldquoDatabases and data managementrdquo (esp SQL)
Section ldquoLanguage Text Speech Audiordquo
bull Exercise 2 Due Teams Report
February 15 (Postponed)
bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)
Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http
ieeexploreieeeorgdocument8258219
February 22
bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo
bull Readings (two weeks worth)
David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf
Similarity
PatternRecognition Ch2 ldquoRepresentationrdquo
6
MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)
bull Further reference
Section ldquoSimilarity Distance Association rdquo
Section ldquoDerived Data Representations rdquo
Section ldquoRecord Linkage Entity Resolution rdquo
Section ldquoClustering Hashing Compression rdquo
March 1
bull Reading
BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)
Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo
HumanComputation Chs 125
bull Further reference
Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo
Section ldquoMaking Up Data (smoothing convolution kernels )rdquo
Section ldquoVision image videordquo
bull Semester Projects Must be Approved Before Spring Break
March 8 - Spring Break
March 15
bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo
bull Reading
Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)
Networks Chs 1 2
MMDS Ch 5 ldquoLink Analysisrdquo
MarkovVisually
bull Further references
Section ldquoNetworks and Graphsrdquo
7
Section ldquoSimulation resampling Markov Chains rdquo
DeepLearning (different kind of network)
bull 25 Project Review
March 22
bull Reading
DeepLearning Ch2 ldquoLinear Algebrardquo
Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)
bull Further reference
Section ldquoDimensionality reduction decomposition rdquo
Section ldquoMultiple measures latent variable measurementrdquo
Section ldquoLinear algebra matrix computationsrdquo
March 29
bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo
bull Reading
Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf
MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)
TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)
MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)
bull Further reference
Section ldquoNonlinear dimension reduction manifold learningrdquo
Section ldquoDatabases and data managementrdquo (esp NoSQL)
Section ldquoTheoretically-structured approaches to data wranglingrdquo
Section ldquoParallelism MapReduce Split-Apply-Combinerdquo
Section ldquoFunctional programmingrdquo
Section ldquoCutting and bleeding edge rdquo
bull 50 Project Review
bull Exercise 3 (Individual) Due 45
8
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by Wednesday 700am each week by email ndash listsof terms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing
Grading Criteria Full 30 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between
bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea small number of assigned exercises Early in the semester these will be done in (assigned)interdisciplinary teams Later exercises will be individual
bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo datasetThe data must be at least partly social (arise from human interactions) There must be somenontrivial computational or informational element to the project There need not be a finalanalysis of the data but there must be some basic calculation of descriptive statistics over thedata and some demonstration of the validity of the data for the (or an) intended scientificpurpose ndash eg representativeness (and of what) balance randomization measurementvalidity etc
March 1 - Deadline for approval of teams and (proposed) projects
March 15 - 25 Project Review
March 29 - 50 Project Review
April 12 - 75 Project Review
April 26 - Team Project Presentations
May 3 - White Paper and Data Replication Archive Due Submit a 4-5 pagepaper that documents what was done and why discusses problems you encounteredprovides an assessment of the validity of the data for an analytic purpose (or how itmight be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar)
3
Course Schedule 2018
January 11
bull Introductions
bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)
bull Further reference
10RulesforData SoftwareCarpentry Lessons
Section ldquoGeneral Resources for Python and Rrdquo
Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)
bull Exercise 1 Team Updates 118 Due 125 BitByBit Exercise 26 (a-g) Teams
TeamFrancisco Sara Claire Rosemary Fangcao
TeamFreelin Brittany Arif So Young Xiaoran
TeamKankane Shipi Steve Lulu Omer
January 18
bull Readings (send list of confusing terms concepts by 700am Jan 17 Wednesday)
BitByBit Ch 1 amp 2
CompSocSci Monroe-No Monroe-5Vs
One article from a different discipline listed in the ldquoMultidisciplinary Perspectivesrdquosection excluding Business-BigData
bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo
bull Exercise 1 Team Updates
January 25
bull Readings (send list of confusing terms concepts by 700 am Jan 24 Wednesday)
GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census
ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics
bull Further reference
Section ldquoiexclCuidadordquo
Section ldquoMeasurement Reliability and Validityrdquo
bull Exercise 1 Due Teams Report
4
bull Determine ldquodiscussion leadrdquo dates
bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams
TeamKelling Claire Brittany Arif
TeamYalcin Omer Lulu Fangcao
TeamPang Rosemary Shipi Sara
TeamSun Xiaoran Steve So Young
February 1
bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo
bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)
Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j
tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10
1016jtourman201607005)
UnobtrusiveMeasures
Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize
ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo
bull Further reference
Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo
Section ldquoMultiple Measures Latent Variable Measurementrdquo
Section ldquoSampling and Survey Designrdquo
Section ldquoOpen data APIs linked data rdquo
Section ldquoSpace and Timerdquo (readings on Time)
February 8
bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)
Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=
0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf
dl=0
5
InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)
FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp
fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)
bull Further reference
Section ldquoSpace and Timerdquo
Section ldquoWeb Scrapingrdquo
Section ldquoData Representations Data Mappingsrdquo
Section ldquoFeature selection feature extraction feature engineering rdquo
Section ldquoDatabases and data managementrdquo (esp SQL)
Section ldquoLanguage Text Speech Audiordquo
bull Exercise 2 Due Teams Report
February 15 (Postponed)
bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)
Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http
ieeexploreieeeorgdocument8258219
February 22
bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo
bull Readings (two weeks worth)
David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf
Similarity
PatternRecognition Ch2 ldquoRepresentationrdquo
6
MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)
bull Further reference
Section ldquoSimilarity Distance Association rdquo
Section ldquoDerived Data Representations rdquo
Section ldquoRecord Linkage Entity Resolution rdquo
Section ldquoClustering Hashing Compression rdquo
March 1
bull Reading
BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)
Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo
HumanComputation Chs 125
bull Further reference
Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo
Section ldquoMaking Up Data (smoothing convolution kernels )rdquo
Section ldquoVision image videordquo
bull Semester Projects Must be Approved Before Spring Break
March 8 - Spring Break
March 15
bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo
bull Reading
Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)
Networks Chs 1 2
MMDS Ch 5 ldquoLink Analysisrdquo
MarkovVisually
bull Further references
Section ldquoNetworks and Graphsrdquo
7
Section ldquoSimulation resampling Markov Chains rdquo
DeepLearning (different kind of network)
bull 25 Project Review
March 22
bull Reading
DeepLearning Ch2 ldquoLinear Algebrardquo
Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)
bull Further reference
Section ldquoDimensionality reduction decomposition rdquo
Section ldquoMultiple measures latent variable measurementrdquo
Section ldquoLinear algebra matrix computationsrdquo
March 29
bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo
bull Reading
Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf
MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)
TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)
MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)
bull Further reference
Section ldquoNonlinear dimension reduction manifold learningrdquo
Section ldquoDatabases and data managementrdquo (esp NoSQL)
Section ldquoTheoretically-structured approaches to data wranglingrdquo
Section ldquoParallelism MapReduce Split-Apply-Combinerdquo
Section ldquoFunctional programmingrdquo
Section ldquoCutting and bleeding edge rdquo
bull 50 Project Review
bull Exercise 3 (Individual) Due 45
8
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Course Schedule 2018
January 11
bull Introductions
bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)
bull Further reference
10RulesforData SoftwareCarpentry Lessons
Section ldquoGeneral Resources for Python and Rrdquo
Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)
bull Exercise 1 Team Updates 118 Due 125 BitByBit Exercise 26 (a-g) Teams
TeamFrancisco Sara Claire Rosemary Fangcao
TeamFreelin Brittany Arif So Young Xiaoran
TeamKankane Shipi Steve Lulu Omer
January 18
bull Readings (send list of confusing terms concepts by 700am Jan 17 Wednesday)
BitByBit Ch 1 amp 2
CompSocSci Monroe-No Monroe-5Vs
One article from a different discipline listed in the ldquoMultidisciplinary Perspectivesrdquosection excluding Business-BigData
bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo
bull Exercise 1 Team Updates
January 25
bull Readings (send list of confusing terms concepts by 700 am Jan 24 Wednesday)
GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census
ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics
bull Further reference
Section ldquoiexclCuidadordquo
Section ldquoMeasurement Reliability and Validityrdquo
bull Exercise 1 Due Teams Report
4
bull Determine ldquodiscussion leadrdquo dates
bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams
TeamKelling Claire Brittany Arif
TeamYalcin Omer Lulu Fangcao
TeamPang Rosemary Shipi Sara
TeamSun Xiaoran Steve So Young
February 1
bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo
bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)
Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j
tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10
1016jtourman201607005)
UnobtrusiveMeasures
Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize
ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo
bull Further reference
Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo
Section ldquoMultiple Measures Latent Variable Measurementrdquo
Section ldquoSampling and Survey Designrdquo
Section ldquoOpen data APIs linked data rdquo
Section ldquoSpace and Timerdquo (readings on Time)
February 8
bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)
Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=
0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf
dl=0
5
InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)
FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp
fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)
bull Further reference
Section ldquoSpace and Timerdquo
Section ldquoWeb Scrapingrdquo
Section ldquoData Representations Data Mappingsrdquo
Section ldquoFeature selection feature extraction feature engineering rdquo
Section ldquoDatabases and data managementrdquo (esp SQL)
Section ldquoLanguage Text Speech Audiordquo
bull Exercise 2 Due Teams Report
February 15 (Postponed)
bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)
Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http
ieeexploreieeeorgdocument8258219
February 22
bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo
bull Readings (two weeks worth)
David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf
Similarity
PatternRecognition Ch2 ldquoRepresentationrdquo
6
MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)
bull Further reference
Section ldquoSimilarity Distance Association rdquo
Section ldquoDerived Data Representations rdquo
Section ldquoRecord Linkage Entity Resolution rdquo
Section ldquoClustering Hashing Compression rdquo
March 1
bull Reading
BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)
Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo
HumanComputation Chs 125
bull Further reference
Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo
Section ldquoMaking Up Data (smoothing convolution kernels )rdquo
Section ldquoVision image videordquo
bull Semester Projects Must be Approved Before Spring Break
March 8 - Spring Break
March 15
bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo
bull Reading
Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)
Networks Chs 1 2
MMDS Ch 5 ldquoLink Analysisrdquo
MarkovVisually
bull Further references
Section ldquoNetworks and Graphsrdquo
7
Section ldquoSimulation resampling Markov Chains rdquo
DeepLearning (different kind of network)
bull 25 Project Review
March 22
bull Reading
DeepLearning Ch2 ldquoLinear Algebrardquo
Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)
bull Further reference
Section ldquoDimensionality reduction decomposition rdquo
Section ldquoMultiple measures latent variable measurementrdquo
Section ldquoLinear algebra matrix computationsrdquo
March 29
bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo
bull Reading
Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf
MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)
TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)
MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)
bull Further reference
Section ldquoNonlinear dimension reduction manifold learningrdquo
Section ldquoDatabases and data managementrdquo (esp NoSQL)
Section ldquoTheoretically-structured approaches to data wranglingrdquo
Section ldquoParallelism MapReduce Split-Apply-Combinerdquo
Section ldquoFunctional programmingrdquo
Section ldquoCutting and bleeding edge rdquo
bull 50 Project Review
bull Exercise 3 (Individual) Due 45
8
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull Determine ldquodiscussion leadrdquo dates
bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams
TeamKelling Claire Brittany Arif
TeamYalcin Omer Lulu Fangcao
TeamPang Rosemary Shipi Sara
TeamSun Xiaoran Steve So Young
February 1
bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo
bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)
Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j
tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10
1016jtourman201607005)
UnobtrusiveMeasures
Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize
ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo
bull Further reference
Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo
Section ldquoMultiple Measures Latent Variable Measurementrdquo
Section ldquoSampling and Survey Designrdquo
Section ldquoOpen data APIs linked data rdquo
Section ldquoSpace and Timerdquo (readings on Time)
February 8
bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)
Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=
0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf
dl=0
5
InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)
FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp
fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)
bull Further reference
Section ldquoSpace and Timerdquo
Section ldquoWeb Scrapingrdquo
Section ldquoData Representations Data Mappingsrdquo
Section ldquoFeature selection feature extraction feature engineering rdquo
Section ldquoDatabases and data managementrdquo (esp SQL)
Section ldquoLanguage Text Speech Audiordquo
bull Exercise 2 Due Teams Report
February 15 (Postponed)
bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)
Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http
ieeexploreieeeorgdocument8258219
February 22
bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo
bull Readings (two weeks worth)
David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf
Similarity
PatternRecognition Ch2 ldquoRepresentationrdquo
6
MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)
bull Further reference
Section ldquoSimilarity Distance Association rdquo
Section ldquoDerived Data Representations rdquo
Section ldquoRecord Linkage Entity Resolution rdquo
Section ldquoClustering Hashing Compression rdquo
March 1
bull Reading
BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)
Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo
HumanComputation Chs 125
bull Further reference
Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo
Section ldquoMaking Up Data (smoothing convolution kernels )rdquo
Section ldquoVision image videordquo
bull Semester Projects Must be Approved Before Spring Break
March 8 - Spring Break
March 15
bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo
bull Reading
Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)
Networks Chs 1 2
MMDS Ch 5 ldquoLink Analysisrdquo
MarkovVisually
bull Further references
Section ldquoNetworks and Graphsrdquo
7
Section ldquoSimulation resampling Markov Chains rdquo
DeepLearning (different kind of network)
bull 25 Project Review
March 22
bull Reading
DeepLearning Ch2 ldquoLinear Algebrardquo
Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)
bull Further reference
Section ldquoDimensionality reduction decomposition rdquo
Section ldquoMultiple measures latent variable measurementrdquo
Section ldquoLinear algebra matrix computationsrdquo
March 29
bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo
bull Reading
Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf
MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)
TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)
MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)
bull Further reference
Section ldquoNonlinear dimension reduction manifold learningrdquo
Section ldquoDatabases and data managementrdquo (esp NoSQL)
Section ldquoTheoretically-structured approaches to data wranglingrdquo
Section ldquoParallelism MapReduce Split-Apply-Combinerdquo
Section ldquoFunctional programmingrdquo
Section ldquoCutting and bleeding edge rdquo
bull 50 Project Review
bull Exercise 3 (Individual) Due 45
8
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)
FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp
fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)
bull Further reference
Section ldquoSpace and Timerdquo
Section ldquoWeb Scrapingrdquo
Section ldquoData Representations Data Mappingsrdquo
Section ldquoFeature selection feature extraction feature engineering rdquo
Section ldquoDatabases and data managementrdquo (esp SQL)
Section ldquoLanguage Text Speech Audiordquo
bull Exercise 2 Due Teams Report
February 15 (Postponed)
bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo
bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)
Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http
ieeexploreieeeorgdocument8258219
February 22
bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo
bull Readings (two weeks worth)
David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf
Similarity
PatternRecognition Ch2 ldquoRepresentationrdquo
6
MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)
bull Further reference
Section ldquoSimilarity Distance Association rdquo
Section ldquoDerived Data Representations rdquo
Section ldquoRecord Linkage Entity Resolution rdquo
Section ldquoClustering Hashing Compression rdquo
March 1
bull Reading
BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)
Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo
HumanComputation Chs 125
bull Further reference
Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo
Section ldquoMaking Up Data (smoothing convolution kernels )rdquo
Section ldquoVision image videordquo
bull Semester Projects Must be Approved Before Spring Break
March 8 - Spring Break
March 15
bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo
bull Reading
Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)
Networks Chs 1 2
MMDS Ch 5 ldquoLink Analysisrdquo
MarkovVisually
bull Further references
Section ldquoNetworks and Graphsrdquo
7
Section ldquoSimulation resampling Markov Chains rdquo
DeepLearning (different kind of network)
bull 25 Project Review
March 22
bull Reading
DeepLearning Ch2 ldquoLinear Algebrardquo
Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)
bull Further reference
Section ldquoDimensionality reduction decomposition rdquo
Section ldquoMultiple measures latent variable measurementrdquo
Section ldquoLinear algebra matrix computationsrdquo
March 29
bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo
bull Reading
Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf
MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)
TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)
MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)
bull Further reference
Section ldquoNonlinear dimension reduction manifold learningrdquo
Section ldquoDatabases and data managementrdquo (esp NoSQL)
Section ldquoTheoretically-structured approaches to data wranglingrdquo
Section ldquoParallelism MapReduce Split-Apply-Combinerdquo
Section ldquoFunctional programmingrdquo
Section ldquoCutting and bleeding edge rdquo
bull 50 Project Review
bull Exercise 3 (Individual) Due 45
8
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)
bull Further reference
Section ldquoSimilarity Distance Association rdquo
Section ldquoDerived Data Representations rdquo
Section ldquoRecord Linkage Entity Resolution rdquo
Section ldquoClustering Hashing Compression rdquo
March 1
bull Reading
BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)
Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo
HumanComputation Chs 125
bull Further reference
Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo
Section ldquoMaking Up Data (smoothing convolution kernels )rdquo
Section ldquoVision image videordquo
bull Semester Projects Must be Approved Before Spring Break
March 8 - Spring Break
March 15
bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo
bull Reading
Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)
Networks Chs 1 2
MMDS Ch 5 ldquoLink Analysisrdquo
MarkovVisually
bull Further references
Section ldquoNetworks and Graphsrdquo
7
Section ldquoSimulation resampling Markov Chains rdquo
DeepLearning (different kind of network)
bull 25 Project Review
March 22
bull Reading
DeepLearning Ch2 ldquoLinear Algebrardquo
Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)
bull Further reference
Section ldquoDimensionality reduction decomposition rdquo
Section ldquoMultiple measures latent variable measurementrdquo
Section ldquoLinear algebra matrix computationsrdquo
March 29
bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo
bull Reading
Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf
MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)
TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)
MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)
bull Further reference
Section ldquoNonlinear dimension reduction manifold learningrdquo
Section ldquoDatabases and data managementrdquo (esp NoSQL)
Section ldquoTheoretically-structured approaches to data wranglingrdquo
Section ldquoParallelism MapReduce Split-Apply-Combinerdquo
Section ldquoFunctional programmingrdquo
Section ldquoCutting and bleeding edge rdquo
bull 50 Project Review
bull Exercise 3 (Individual) Due 45
8
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Section ldquoSimulation resampling Markov Chains rdquo
DeepLearning (different kind of network)
bull 25 Project Review
March 22
bull Reading
DeepLearning Ch2 ldquoLinear Algebrardquo
Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)
bull Further reference
Section ldquoDimensionality reduction decomposition rdquo
Section ldquoMultiple measures latent variable measurementrdquo
Section ldquoLinear algebra matrix computationsrdquo
March 29
bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo
bull Reading
Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf
MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)
TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)
MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)
bull Further reference
Section ldquoNonlinear dimension reduction manifold learningrdquo
Section ldquoDatabases and data managementrdquo (esp NoSQL)
Section ldquoTheoretically-structured approaches to data wranglingrdquo
Section ldquoParallelism MapReduce Split-Apply-Combinerdquo
Section ldquoFunctional programmingrdquo
Section ldquoCutting and bleeding edge rdquo
bull 50 Project Review
bull Exercise 3 (Individual) Due 45
8
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
April 5
bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo
bull Reading
Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf
NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo
bull Further reference
Section ldquoLanguage text speech audiordquo
GloVe word2vec word2vecExplained t-SNE
Section ldquoMobile devices distributed sensors rdquo
Section ldquoCrowdsourcing human computation citizen science rdquo
Section ldquoScaling iteration streaming data online algorithmsrdquo
bull Exercise 3 Due
April 12
bull Reading
BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples
CausalInference Chs 1-2
bull Further reference
Section ldquoExperimental and observational designs for causal inferencerdquo
Section ldquoCrowdsourcing web experimentsrdquo
Section ldquoHuman subjects rdquo
bull 75 Project Review
April 19
bull Reading
BitByBit Ch 6 ldquoEthicsrdquo
PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu
irbcommonrulechanges
DataPrivacy
9
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Reproducibility
Refresh on MachineBias
bull Further reference
Section ldquoEthics and Scientific Responsibility in Big Social Datardquo
Section ldquoiexclCuidadordquo
April 26 - LAST CLASS MEETING
bull Team Project Presentations
May 3 - Team Projects Due
10
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Past Visiting Speakers in SoDA 501 and 502
SoDA 502 (Fall 2017)
bull Chris Zorn (PLSC)
bull Diane Felmlee (SOC)
bull Alan MacEachren (GEOG)
bull Eric Plutzer (PLSC)
bull James LeBreton (PSYCH)
bull Sesa Slavkovic (STAT)
bull Guido Cervone (GEOG)
bull Ashton Verdery (SOC)
bull Maggie Niu (STAT)
bull Reka Albert (PHYS)
SoDA 501 (Spring 2017)
bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo
bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo
bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo
bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo
bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo
bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo
bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo
bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo
bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo
bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo
bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo
SoDA 502 (Fall 2016)
bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo
bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo
bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo
bull Zita Oravecz (HDFS)
bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo
bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo
bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo
bull David Reitter (IST)
bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo
11
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
SoDA 501 (Spring 2016)
bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo
bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo
bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo
bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo
bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo
bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo
bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo
bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo
12
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Readings and References (updated Spring 2018)
We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)
dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu
Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material
sect Material for which a legal selection is or will be provided through the class Box folder
Big Data amp Social Data Analytics
Overviews
bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915
721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf
bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo
bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi
stanfordedu~ullmanmmdsnhtml)
bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo
bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo
Burt-schtick
bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https
doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)
bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017
S1049096514001760
bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom
13
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)
bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)
bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)
Multidisciplinary Perspectives
bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-
data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https
hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century
bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015
bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10
1111ropr12080
bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full
bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823
bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-
comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10
bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-
053457
bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science
iexclCuidado Traps Biases Problems Pains Perils
bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking
harvardedufilesgkingfiles0314policyforumffpdf
bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https
wwwpropublicaorgseriesmachine-bias See especially
14
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-
criminal-sentencing
Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-
to-reach-jew-haters
bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-
2016-us-elections
bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758
bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041
bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14
bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878
bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology
microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk
html
bull MMDS 122-123 on Bonferroni
bull Also BDSS-Census
Research Design and Measurement
Overviews
bull BitByBit
bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb
bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press
Measurement Reliability and Validity
bull ResearchMethodsKB ldquoMeasurementrdquo
bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo
bull Reliability see also FightinWords
bull Validity see also Quinn10-Topics Monroe-5Vs
Indirect Unobtrusive Nonreactive Measures Data Exhaust
15
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)
bull BitByBit Examples in Chapter 2
Multiple Measures Latent Variable Measurement
bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo
bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess
librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)
bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10
10072F978-1-4419-9650-3
bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo
bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo
bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml
Sampling and Survey Design
bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling
bull ResearchMethodsKB ldquoSamplingrdquo
bull NRCreport Ch 8 ldquoSampling and Massive Datardquo
bull BitByBit Ch 3 ldquoAsking Questionsrdquo
bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp
rep=rep1amptype=pdf)
bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248
bull See also FightinWords (re hidden heteroskedasticity in sample variance)
bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-
in-the-big-data-world-code-b9387cff0c55
bull MMDS Hash Functions (132) Sampling in streams (42)
bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)
Experimental and Observational Designs for Causal Inference
16
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-
inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)
bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples
bull ResearchMethodsKB ldquoDesignrdquo
bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo
bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo
bull Monroe-No
bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml
Technologies for Primary and Secondary Data Collection
Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing
bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg
101007s00779-005-0046-3
bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012
0002
bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo
bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011
Crowdsourcing Human Computation Citizen Science Web Experiments
bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary
comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo
bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo
bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo
bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j
chb201305009
bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246
17
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb
201305009
bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013
Open Data File Formats APIs Semantic Web Linked Data
bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo
bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-
data
bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices
file-formats
bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis
bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https
marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom
bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu
edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo
bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook
1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo
bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww
morganclaypoolcomtocwbe111
Web Scraping
bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081
bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11
bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732
bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml
Ethics amp Scientific Responsibility in Big Social Data
Human Subjects Consent Privacy
bull BitByBit Ch 6 (ldquoEthicsrdquo)
bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)
Human Subjects Research IRB httpswwwresearchpsueduirb
18
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges
Responsible Conduct of Research httpswwwresearchpsuedueducationrcr
Research Misconduct httpswwwresearchpsueduresearchmisconduct
SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu
trainingsari
bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg
publicationspapers2012menlo_report_actual_formatted
bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes
datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society
bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https
aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf
bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research
bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003
The Science of Data Privacy
bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814
bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701
00752
bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-
04898-2_202
bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http
projecteuclidorgeuclidaos1449755958
bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002
bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo
Transparency Reproducibility and Team Science
bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)
19
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317
bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021
bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons
bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo
bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001
bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)
bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml
Social Bias Fair Algorithms
bull MachineBias
bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv
orgabs160807187
bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520
bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413
bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807
Databases and Data Management
bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo
bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978
bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo
bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2
bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom
ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032
20
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom
tocdtm11
Data Wrangling
Theoretically-structured approaches to data wrangling
bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg
bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg
bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https
csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo
bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler
Data wrangling practice
bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo
bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience
book9780128037812
bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804
bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink
springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0
Record Linkage Entity Resolution Deduplication
bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2
(esp Ch 2 ldquoThe Data Matching Processrdquo)
bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo
bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf
bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv
orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo
ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training
21
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689
bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf
bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo
bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)
bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)
bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572
bull See also resampling and simulation methods
bull See also feature engineering preprocessing
bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo
(Direct) Data Representations Data Mappings
bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo
bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu
book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo
bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)
bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-
45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo
bull See also Social Data Structures
Similarity Distance Association Covariance The Kernel Trick
bull PatternRecognition Section 23 ldquoProximity Measuresrdquo
bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-
ols-coefficients
bull For relatively comprehensive lists see also
M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc
downloaddoi=10112126533amprep=rep1amptype=pdf
Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$
scipdfsGS315JGpdf
22
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http
csispaceeductappertdpsd861-12session4-p2pdf
Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf
bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu
~jordancourses281B-spring04lectureslec3pdf
bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_
2017pdf
Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings
The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo
Clustering hashing quantization blocking compression
bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)
bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)
bull Multivariate-R Ch 6 ldquoClusteringrdquo
bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo
bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper
bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu
edu101023A1014095614948
bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560
Feature selection feature extraction feature engineering weighting preprocessing
bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo
bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo
bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo
bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords
bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo
23
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation
bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo
bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo
bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat
cmuedu~cshaliziADAfaEPoV
See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent
bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo
bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove
bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu
10103844565
bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full
bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin
papersNN00newpdf
bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512
502546
bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo
Nonlinear dimensionality reduction Manifold learning
bull DeepLearning Section 5113 ldquoManifold Learningrdquo
bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)
bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml
bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)
bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http
webcseohio-stateedu~belkin8papersLEM_NC_03pdf
bull DeepLearning Ch14 (ldquoAutoencodersrdquo)
bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781
bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722
bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9
vandermaaten08avandermaaten08apdf
24
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Computation and Scaling Up
Scientific computing computation at scale
bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo
bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo
bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https
peopledukeedu~ccc14sta-663indexhtml
bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml
Numerical computing
bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning
bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml
Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)
bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)
bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo
bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml
Linear algebra matrix computations
bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press
bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-
science-repoRecommender-Systems-[Netflix]pdf
bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf
bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom
fast-randomized-svd
bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs
utahedu~hariteachingbigdatagemulla11dsgdpdf
bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom
jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229
bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom
blog20100119dont-invert-that-matrix
Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference
25
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev
markov-chains
bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)
bull Bayes ComputationalStatistics-Py
bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo
bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml
bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml
Parallelism MapReduce Split-Apply-Combine
bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https
wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube
bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)
bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo
bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo
bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)
bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww
rstudiocomresourcesvideosdata-science-in-the-tidyverse
bull See also FactorSGD
Functional Programming
bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises
bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo
bull See also Haskell
Scaling iteration streaming data online algorithms (Spark)
bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo
bull MMDS Ch 4 ldquoMining Data Streamsrdquo
bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu
software
bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu
edubook101007978-1-4842-0964-6
bull DSHandbook-Py Ch 13 ldquoBig Datardquo
bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo
General Resources for Python and R
26
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http
onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919
bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR
bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython
Cutting and Bleeding Edge of Data Science Languages
bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-
comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction
docID=1901910 DaggerScala site httpswwwscala-langorg (See also )
bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom
libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang
org
bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess
librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg
bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https
link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu
detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg
bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww
udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg
bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai
Social Data Structures
Space and Time
bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094
bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford
bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007
2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016
bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16
27
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo
bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998
bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill
bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings
bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution
bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran
r-projectorgwebviewsTimeSerieshtml
Network Graphs
bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome
kleinbernetworks-book
bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo
bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https
link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8
bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook
1010072F978-3-319-53004-8
Hierarchy Clustered Data Aggregation Mixed Models
bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457
bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction
docID=10748641
Social Data Channels
Language Text Speech Audio
bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3
bull InfoRetrieval
bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP
28
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan
mps028
bull Quinn-Topics
bull FightinWords
bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http
linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8
bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo
bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml
bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml
bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam
pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI
bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11
bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11
Vision Image Video
bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook
bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo
bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution
bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww
morganclaypoolcomtocivm11
bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom
toccov11
Approaches to Learning from Data (The Analytics Layer)
bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo
bull MMDS (data-mining)
bull CausalInference
bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21
bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen
econometrics
29
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu
~garethISLindexhtml
bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer
bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http
linkspringercombook1010072F978-0-387-92407-6
bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007
2F978-1-4614-2299-0
bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer
bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038
nature14539)
See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources
bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook
101007978-981-10-2534-1
30
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Penn State Policy Statements
Academic Integrity
Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts
Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others
Disability Accomodation
Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)
In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations
Psychological Services
Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation
bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395
bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400
bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741
Educational Equity
Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)
31
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32
Background Knowledge
Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others
With regard to specialization students are expected to have advanced (graduate) training in ONEof the following
bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR
bull statistics (as would be the case for a second-year PhD student in Statistics) OR
bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR
bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)
This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor
With regard to general preparation students are expected to have ALL of the following technicalknowledge
bull basic programming skills including basic facility with R ampor Python AND
bull basic knowledge of relational databases ampor geographic information systems AND
bull basic knowledge of probability applied statistics ampor social science research design AND
bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)
It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program
To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)
32