introduction. computational journalism week 1
DESCRIPTION
Jonathan Stray, Columbia University, Fall 2015Syllabus at http://www.compjournalism.com/?p=133TRANSCRIPT
-
Frontiers of Computational Journalism
Columbia Journalism School
Week 1: Introduction
September 11, 2015
-
Lecture 1: Basics
Computer Science and Journalism
Course Structure
Interpreting High Dimensional Data
-
Computational Journalism: Denitions
Broadly defined, it can involve changing how stories are discovered, presented, aggregated, monetized, and archived. Computation can advance journalism by drawing on innovations in topic detection, video analysis, personalization, aggregation, visualization, and sensemaking. - Cohen, Hamilton, Turner, Computational Journalism, 2011
-
Stories will emerge from stacks of financial disclosure forms, court records, legislative hearings, officials' calendars or meeting notes, and regulators' email messages that no one today has time or money to mine. With a suite of reporting tools, a journalist will be able to scan, transcribe, analyze, and visualize the patterns in these documents. - Cohen, Hamilton, Turner, Computational Journalism, 2011
Computational Journalism: Denitions
-
Cohen et al. model
Data Reporting
User
ComputerScience
-
CS for presentation / interaction
Data Reporting
CSCS
User
-
Filter stories for user
Data Reporting
Data Reporting
Data Reporting
CS
Filtering
CS
CS
CSCS
CS
CS
User
-
Examples of lters Facebook news feed What an editor puts on the front page Google News Reddits comment system Twitter Techmeme New York Times recommendation system
-
http://snap.stanford.edu/nifty
-
Kony 2012 early network, by Gilad Lotan
-
CS in Journalism
Eects
Data Reporting
Data Reporting
Data Reporting
CS
Filtering
CS
CS
CSCS
CS
CS
User
CS
-
Journalism with algorithms vs.
Journalism about algorithms
-
Websites Vary Prices, Deals Based on Users' Information Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012
-
Message Machine Jeff Larson, Al Shaw, ProPublica, 2012
-
Where does data come from?
-
Computer Science in Journalism
Reporting
Presentation Filtering Tracking
Algorithmic accountability
-
Quantication
Data
-
Journalism as a cycle
Data
Reporting
Filtering
EectsCS
CS
CS
CS
User
-
Computational Journalism: Denitions
the application of computer science to the problems of public information, knowledge, and belief, by practitioners who see their mission as outside of both commerce and government. - Jonathan Stray, A Computational Journalism Reading List, 2011
-
Course Structure Information retrieval: TF-IDF, search engines Text analysis: clustering and topic modeling Information filtering systems Social network analysis Knowledge representation Drawing conclusions from data Writing about data Information Security Tracking flow and effects
-
Natural Language Processing
Visualization
Sociology
Articial Intelligence
Cognitive Science
Statistics
Graph Theory
Clustering
Text Analysis
Filter Design
Social Network Analysis
Knowledge Representation
Drawing Conclusions
Information Retrieval
Epistemology
-
AdministrationAssignment after each class
Four assignments require programming, but your writing counts for more than your code!
Course blog http://compjournalism.com
Final project for 6-pt students only
-
GradingDual degree students
Pass/Fail. Final project: paper, story, or software.
Non-journalism students 80% assignements 20% class participation
-
Definition of data?
-
a collection of related pieces of
recorded information
My Definition of data
-
structured data
-
unstructured data
-
Quantication
x1x2x3xN
!
"
#######
$
%
&&&&&&&
-
Other things that are tricky to quantify, but quantied anyway
Intelligence Academic performance Gender Race, ethnicity, nationality Number of sexual harassment incidents Income Political Ideology ...
-
Dierent types of quantitative Numeric
o continuous o countable o bounded? o units of measurement?
Categorical o finite, e.g. {on, off} o infinite e.g. {red, yellow, blue, ... chartreuse} o ordered? o equivalence classes or other structure?
-
Dierent types of scalesTemperature Continuous scale, fixed zero point, physical units, comparative, uniform
Likert Scale Discrete scale, no xed origin , abstract units, comparative, non-uniform
-
Likert scales are non-uniform
-
No averages on a non-uniform scaleIts not linear, so is 2X1 twice as good?
(X1+c) (X2+c) X1 X2 Lots of things dont make much sense, such as
sum(X1 ... XN) / N = ?
Average is not well defined! (Nor std dev, etc.) But rank order statistics are robust. And all of this might not be a problem in practice.
-
Other issues withquantitative Where did the data come from?
o physical measurement o computer logging o human recording
What are the sources of error? o measurement error o missing data o ambiguity in human classification o process errors o intentional bias / deception
-
Vector representation of objectsFundamental representation for many data mining, clustering, machine learning, visualization, NLP, etc. algorithms.
x1x2x3xN
!
"
#######
$
%
&&&&&&&
Each xi is a numerical or categorical featureN = number of features or dimension
-
Examples of features number of claws latitude color {red, yellow, blue} number of break-ins 1 for bought X, 0 for did not buy X time, duration, etc. number of times word Y appears in document votes cast
-
Feature selectionTechnical meaning in machine learning etc.:
which variables matter?
Were journalists, so were interested in an earlier
process:
how to describe the world in numbers?
-
Choosing Features
where k N
x1x2x3xN
!
"
#######
$
%
&&&&&&&
x f (1)x f (2)
x f (k )
!
"
#####
$
%
&&&&&
JournalismHow do we represent the
world numerically?
Machine learningWhich variables carry the most information?
-
Examples of vector representationsObvious
o movies watched / items purchased o Legislative voting history for a politician o crime locations
Less obvious, but standard o document vector space model o psychological survey results
Tricky research problem: disparate field types o Corporate filing document o Wikileaks SIGACT
-
What can we do with vectors? Predict one variable based on others
o this is called regression o or maybe "classification" o supervised machine learning
Group similar items together o This is clustering o or maybe "classification" with unknown categories o unsupervised machine learning