corpus-based discourse analysis - philipp heinrich · philipp heinrich, msc (fau) corpus-based...

47
Corpus-Based Discourse Analysis Recent Developments and Future Directions Philipp Heinrich Computational Corpus Linguistics Group Friedrich-Alexander University of Erlangen-Nuremberg http://philipp-heinrich.eu Seoul October 1, 2018 Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 1 / 32

Upload: others

Post on 02-Aug-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse AnalysisRecent Developments and Future Directions

Philipp Heinrich

Computational Corpus Linguistics GroupFriedrich-Alexander University of Erlangen-Nuremberg

http://philipp-heinrich.eu

SeoulOctober 1, 2018

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 1 / 32

Page 2: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Computational Linguistics

source: Nautilus (Christopher D. Manning)

Page 3: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus Linguistics

a corpusI is a collection of machine-readable textsI can be processed and analyzed using methods from

computational linguisticsI can be a sample of authentic language data and can as

such be representative for a language (variety)

corpus linguisticsI creation and processing of corporaI analysis and interpretation of corpora

research questions in corpus linguisticsI main goal: research of language usageI empirical testing of linguistic hypothesesI language varieties and dialectsI corpus-based grammars, psycho-linguistics, . . .

Page 4: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

1 IntroductionComputational Corpus LinguisticsMethods in CCL

2 Corpus-Based Discourse AnalysisBasic MethodologyCase StudiesExtensions

3 The Future of CCLDeep Learning and CCLTowards a Hermeneutic Cyborg

Page 5: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Keywords

keywords are words that occur more frequently in a text thanwhat would be expected assuming random variation

keywords are calculated with respect to a reference corpus

contingency table of observed frequencies for every word w :

corpus 1 corpus 2w k1 k2¬w n1 − k1 n2 − k2

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 4 / 32

Page 6: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Keywords

keywords are words that occur more frequently in a text thanwhat would be expected assuming random variation

keywords are calculated with respect to a reference corpus

contingency table of observed frequencies for every word w :

corpus 1 corpus 2w k1 k2¬w n1 − k1 n2 − k2

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 4 / 32

Page 7: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Keywords

keywords are words that occur more frequently in a text thanwhat would be expected assuming random variation

keywords are calculated with respect to a reference corpus

contingency table of observed frequencies for every word w :

corpus 1 corpus 2w O := O11 O12 = R1¬w O21 O22 = R2

= C1 = C2 = N

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 4 / 32

Page 8: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Indifference (Independence)

association measures (AMs) provide a quantification of thedivergence of observed frequencies from their expectedfrequencies s. t. independence in contingency table

indifference table:corpus 1 corpus 2

w E := E11 =R1C1N E12 =

R1C2N = R1

¬w E21 =R2C1N E22 =

R2C2N = R2

= C1 = C2 = N

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 5 / 32

Page 9: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Statistical Association Measures

corpus 1 corpus 2w O11 vs. E11 O12 vs. E12 R1¬w O21 vs. E21 O22 vs. E22 R2

= C1 = C2 = N

t-score = O−E√O

LL = 2∑ijOij log

Oij

Eij

χ2 =∑

ij(Oij−Eij )

2

Eij

PoiL = e−E11 EO1111O11!

Fisher =min{R1,C1}∑

k=O11

(C1k )·(

C2R1−k)

( NR1)

. . .

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 6 / 32

Page 10: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Collocations

distributional hypothesis (Firth, 1957):I “you shall know a word by the company it keeps”I “one of the meanings of night is its collocability with dark, and

of dark, of course, its collocation with night”

collocations are based on observed co-occurrencefrequencies of word pairs (w1,w2):

w2 ¬w2

w1 O11 O12 = R1

¬w1 O21 O22 = R2

= C1 = C2 = N

different types of co-occurrence

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 7 / 32

Page 11: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Collocations

distributional hypothesis (Firth, 1957):I “you shall know a word by the company it keeps”I “one of the meanings of night is its collocability with dark, and

of dark, of course, its collocation with night”

collocations are based on observed co-occurrencefrequencies of word pairs (w1,w2):

w2 ¬w2

w1 O11 O12 = R1

¬w1 O21 O22 = R2

= C1 = C2 = N

different types of co-occurrence

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 7 / 32

Page 12: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Surface Co-occurrence

A vast deal of coolness and a peculiar degree of judgement, are requisite in catching a hat . A man must

not be precipitate, or he runs over it ; he must not rush into the opposite extreme, or he loses italtogether. [. . . ] There was a fine gentle wind, and Mr. Pickwick’s hat rolled sportively before it . The

wind puffed, and Mr. Pickwick puffed, and the hat rolled over and over as merrily as a lively porpoisein a strong tide ; and on it might have rolled, far beyond Mr. Pickwick’s reach, had not its course been

providentially stopped, just as that gentleman was on the point of resigning it to its fate.

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 8 / 32

Page 13: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Textual Co-occurrence

A vast deal of coolness and a peculiar degree of judgement, arerequisite in catching a hat.

hat —

A man must not be precipitate, or he runs over it ; — over

he must not rush into the opposite extreme, or he loses italtogether.

— —

There was a fine gentle wind, and Mr. Pickwick’s hat rolledsportively before it.

hat —

The wind puffed, and Mr. Pickwick puffed, and the hat rolledover and over as merrily as a lively porpoise in a strong tide ;

hat over

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 9 / 32

Page 14: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Syntactic Co-occurrence

In an open barouche [. . . ] stood a stout old gentleman, in a blue coat

and bright buttons, corduroy breeches and top-boots ; two

young ladies in scarfs and feathers ; a young gentleman apparently

enamoured of one of the young ladies in scarfs and feathers ; a lady

of doubtful age, probably the aunt of the aforesaid ; and [. . . ]

open barouchestout gentleman

old gentlemanblue coat

bright buttonyoung ladyyoung gentlemanyoung lady

doubtful age

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 10 / 32

Page 15: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Introduction Methods in CCL

Collocates of bucket (noun)noun f

water 183spade 31plastic 36slop 14size 41mop 16record 38bucket 18ice 22seat 20coal 16density 11brigade 10algorithm 9shovel 7container 10oats 7sand 12Rhino 7champagne 10

verb f

throw 36fill 29randomize 9empty 14tip 10kick 12hold 31carry 26put 36chuck 7weep 7pour 9douse 4fetch 7store 7drop 9pick 11use 31tire 3rinse 3

adjective f

large 37single-record 5cold 13galvanized 4ten-record 3full 20empty 9steaming 4full-track 2multi-record 2small 21leaky 3bottomless 3galvanised 3iced 3clean 7wooden 6old 19ice-cold 2anti-sweat 1

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 11 / 32

Page 16: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

1 IntroductionComputational Corpus LinguisticsMethods in CCL

2 Corpus-Based Discourse AnalysisBasic MethodologyCase StudiesExtensions

3 The Future of CCLDeep Learning and CCLTowards a Hermeneutic Cyborg

Page 17: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Basic Methodology

From Text to Discourse

Foucault (1969): discourses as statements in conversationinterpretation of text means categorizing

I utterancesI sentencesI paragraphsI tweets

the categoriesI are not known a prioriI must be made up on the fly by the hermeneutic interpreter

CDA is fundamentally different from (statistical) textclassificationultimate goal of critical discourse analysis: discover what issaid by whom (power relations)

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 12 / 32

Page 18: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Basic Methodology

From Text to Discourse

Foucault (1969): discourses as statements in conversationinterpretation of text means categorizing

I utterancesI sentencesI paragraphsI tweets

the categoriesI are not known a prioriI must be made up on the fly by the hermeneutic interpreter

CDA is fundamentally different from (statistical) textclassificationultimate goal of critical discourse analysis: discover what issaid by whom (power relations)

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 12 / 32

Page 19: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Basic Methodology

Concordances

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 13 / 32

Page 20: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Basic Methodology

Corpus-Based Discourse Analysis (CDA)

CDA means analyzing and deconstructing concordance linesI concordances are the essence of discourses

finding discourses: nodes + attitudesI (topic) nodes can be defined by keywords or (more generally)

corpus queriesI attitudes: collocates that are retrieved by statistical methods

examplesI “refugees as victims” (Baker, 2006)I “Fukushima as worst case scenario”

in practice:look at (n best) collocates of topic nodecategorize into on-the-fly-groups

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 14 / 32

Page 21: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Basic Methodology

Corpus-Based Discourse Analysis (CDA)

CDA means analyzing and deconstructing concordance linesI concordances are the essence of discourses

finding discourses: nodes + attitudesI (topic) nodes can be defined by keywords or (more generally)

corpus queriesI attitudes: collocates that are retrieved by statistical methods

examplesI “refugees as victims” (Baker, 2006)I “Fukushima as worst case scenario”

in practice:look at (n best) collocates of topic nodecategorize into on-the-fly-groups

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 14 / 32

Page 22: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Basic Methodology

Collocations

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 15 / 32

Page 23: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Case Studies

Case Studiesrefugees (KhosraviNik, 2010)

I “The representation of refugees, asylum seekers andimmigrants in British newspapers: a critical discourse analysis.”

I CDA investigation on discursive strategies employed by variousBritish newspapers between 1996-2006 in the ways theyrepresent refugees, asylum seekers and immigrants.

gender (Baker, 2014)I “Using Corpora to Analyze Gender”I collection of case studies wrt. changes in sexist and non-sexist

language use over time, personal adverts, press representationof gay men, and the ways that boys and girls are constructedthrough language

LGBT (Love and Baker, 2015)I “The hate that dare not speak its name?”I How have the British Parliamentary arguments against LGBT

equality changed in response to decreasing social acceptabilityof discriminatory language against minority groups?

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 16 / 32

Page 24: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Case Studies

Case Studiesrefugees (KhosraviNik, 2010)

I “The representation of refugees, asylum seekers andimmigrants in British newspapers: a critical discourse analysis.”

I CDA investigation on discursive strategies employed by variousBritish newspapers between 1996-2006 in the ways theyrepresent refugees, asylum seekers and immigrants.

gender (Baker, 2014)I “Using Corpora to Analyze Gender”I collection of case studies wrt. changes in sexist and non-sexist

language use over time, personal adverts, press representationof gay men, and the ways that boys and girls are constructedthrough language

LGBT (Love and Baker, 2015)I “The hate that dare not speak its name?”I How have the British Parliamentary arguments against LGBT

equality changed in response to decreasing social acceptabilityof discriminatory language against minority groups?

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 16 / 32

Page 25: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Case Studies

PhD project Exploring the Fukushima Effectidentification and analysis of the tempo-spatial propagation ofdiscourses in the transnational algorithmic public spherecase study: Fukushima Effect (cf. Gono’i, 2015)

I attitudes and opinions towards energy sourcesdata: mass and social media (German, Japanese)

I intra- and transmedial and -nationalI “edited mass communication” vs. “mass self-communication”

further information:I www.linguistik.fau.de/projects/efe/I funded by the Emerging Fields Initiative of FAUI Team:

F Chair of Computational Corpus LinguisticsF Chair of Japanese StudiesF Chair of Communication ScienceF Chair of Visual Computing

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32

Page 26: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Case Studies

Corpora – Social Media (Twitter)

German Twitter10,266,835 original postslinguistic annotation:

I tokenization: SoMaJo (Proisl and Uhrig, 2016)I POS-tagging: SoMeWeTa (Proisl, 2018)I lemmatization: work in progress

Japanese Twitter411,452,027 original postslinguistic annotation:

I special dictionary: ipadic-neologd (Sato et al., 2017)

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 18 / 32

Page 27: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Extensions

Identification of Social Bots (Schäfer et al., 2017)

1 normalization of texts

2 mapping of normalized strings onto tweet ids3 extension: hierarchical clustering based on Levenshtein

distance

Footprint of a Social Bot netnumber of near duplicatesnumber of user accounts

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 19 / 32

Page 28: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Extensions

Identification of Social Bots during the Japanese GeneralElection of 2014

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 20 / 32

Page 29: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Extensions

Visualization

high-dimensional word embeddings (Word2Vec) (Mikolovet al., 2013)

I based on shallow, two-layer neural networksI capturing co-occurrence information of words in 50–1000

dimensionst-distributed stochastic neighbour-embedding (t-SNE) (van derMaaten and Hinton, 2008)

I project high-dimensional embeddings onto two-dimensionalplane

I semantically similar items are pre-grouped togethersize of lexical items represents association strength towards(topic) node (Evert, 2008)

I different AMs retrieve different sets of collocates and sizes

see Heinrich et al. (2018); Heinrich and Schäfer (2018)

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 21 / 32

Page 30: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Visualizing Collocational Profiles (node: Fukushima)

Page 31: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Visualizing Collocational Profiles (node: Nuclear Phase-Out)

Page 32: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Extensions

Higher-Order Collocates1 discourse collocates

I straightforward generalization with respect to textualco-occurrence

I look at co-occurrence frequencies of tweets that were identifiedto be part of the discourse at hand (topic + attitude)

I collocates represent lexical items that are particularlyimportant for the discourse

2 second-order topic-collocatesI look at co-occurrence frequencies of one set of lexical items c

in tweets that are about a certain topic tI for all w : compare co-occurrence frequencies of w with c

among tweets that contain t with marginal frequencies of w inall tweets that contain t

I collocates of c that are particulary important for the topic t

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 24 / 32

Page 33: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Corpus-Based Discourse Analysis Extensions

Higher-Order Collocates1 discourse collocates

I straightforward generalization with respect to textualco-occurrence

I look at co-occurrence frequencies of tweets that were identifiedto be part of the discourse at hand (topic + attitude)

I collocates represent lexical items that are particularlyimportant for the discourse

2 second-order topic-collocatesI look at co-occurrence frequencies of one set of lexical items c

in tweets that are about a certain topic tI for all w : compare co-occurrence frequencies of w with c

among tweets that contain t with marginal frequencies of w inall tweets that contain t

I collocates of c that are particulary important for the topic t

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 24 / 32

Page 34: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Second-Order Collocates

Figure: Paragraph-collocates of Germany in the FAZ corpus.

Page 35: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Second-Order Collocates

Figure: Collocates of Germany in energy-transition paragraphs.

Page 36: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

1 IntroductionComputational Corpus LinguisticsMethods in CCL

2 Corpus-Based Discourse AnalysisBasic MethodologyCase StudiesExtensions

3 The Future of CCLDeep Learning and CCLTowards a Hermeneutic Cyborg

Page 37: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

The Future of CCL Deep Learning and CCL

Deep Learning and AI

artificial neural networksI general end-to-end ML algorithmsI origins in 1950sI recent hype due to improvements in processing power

amazing performance inI visual object recognitionI OCRI text categorizationI machine translationI strategic games (Go)I simulating humans (Google assistant)

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 27 / 32

Page 39: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

The Future of CCL Deep Learning and CCL

Will human input become irrelevant?

standard toolbox of corpus linguistics:I concordancingI frequencies and frequency comparisonI collocations

these techniques have been around for 50 years!

AI techniques outperform humans when it comes to real-worldapplications

I even the creation of gold-standard data (manual annotation)becomes less and less important

I why bother with rule-based systems?

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 29 / 32

Page 40: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Digital Humanities

source: Voyant Tools

Page 41: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

The Future of CCL Towards a Hermeneutic Cyborg

Towards a Hermeneutic Cyborg

1 interoperabilityI query tool → quantitative data → visualizationI exchange quantitative results and manual grouping across

systems

2 interactivityI integrate larger part of workflow into corpus softwareI maintain connection to concordancesI implement visualization components in analysis tools

3 integrationI key challenge: how to feed back information from manual

grouping into quantitative procedures?I applied to CDA: how to update discourse embeddings?

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 31 / 32

Page 42: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

The Future of CCL Towards a Hermeneutic Cyborg

Towards a Hermeneutic Cyborg

1 interoperabilityI query tool → quantitative data → visualizationI exchange quantitative results and manual grouping across

systems

2 interactivityI integrate larger part of workflow into corpus softwareI maintain connection to concordancesI implement visualization components in analysis tools

3 integrationI key challenge: how to feed back information from manual

grouping into quantitative procedures?I applied to CDA: how to update discourse embeddings?

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 31 / 32

Page 43: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

The Future of CCL Towards a Hermeneutic Cyborg

Towards a Hermeneutic Cyborg

1 interoperabilityI query tool → quantitative data → visualizationI exchange quantitative results and manual grouping across

systems

2 interactivityI integrate larger part of workflow into corpus softwareI maintain connection to concordancesI implement visualization components in analysis tools

3 integrationI key challenge: how to feed back information from manual

grouping into quantitative procedures?I applied to CDA: how to update discourse embeddings?

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 31 / 32

Page 44: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Mixed-Methods Discourse Analysis

Page 45: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

Thanks for listening.Questions?

Page 46: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

References

P. Baker. Using Corpora to Analyze Gender. Bloomsbury Publishing, 2014.Paul Baker. Using Corpora in Discourse Analysis. Continuum, London, 2006.Stefan Evert. Corpora and collocations. In Anke Lüdeling and Merja Kytö, editors,

Corpus Linguistics. An International Handbook, chapter 58. Mouton de Gruyter,Berlin, 2008.

J.R. Firth. Papers in linguistics, 1934-1951. Oxford University Press, 1957.Michel Foucault. L’Archéologie du savoir. Éditions Gallimard, Paris, 1969.Ikuo Gono’i. 2015-nen ANPO, Minshushugi wo futatabi hajimeru wakamono-tachi

(ANPO in 2015. The Youth that is restarting Democracy), 2015.Philipp Heinrich and Fabian Schäfer. Extending corpus-based discourse analysis for

exploring japanese social media. In Proceedings of the Asia Pacific CorpusLinguistics Conference 2018, 2018.

Philipp Heinrich, Christoph Adrian, Olena Kalashnikova, Fabian Schäfer, and StefanEvert. A Transnational Analysis of News and Tweets about Nuclear Phase-Out inthe Aftermath of the Fukushima Incident. In Andreas Witt, Jana Diesner, andGeorg Rehm, editors, Proceedings of the LREC 2018 “Workshop on ComputationalImpact Detection from Text Data”, Paris, 2018. ELRA.

Majid KhosraviNik. The representation of refugees, asylum seekers and immigrants inbritish newspapers : a critical discourse analysis. Journal of Language and Politics,9(1):1–28, 2010.

Robbie Love and Paul Baker. The hate that dare not speak its name? Journal ofLanguage Aggression and Conflict, 3(1):57–86, October 2015.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation ofword representations in vector space. CoRR, abs/1301.3781, 2013.

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 32 / 32

Page 47: Corpus-Based Discourse Analysis - Philipp Heinrich · Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 17 / 32. Corpus-Based Discourse Analysis Case Studies

The Future of CCL Towards a Hermeneutic Cyborg

Thomas Proisl. SoMeWeTa: A Part-of-Speech Tagger for German Social Media andWeb Texts. In Proceedings of the Eleventh International Conference on LanguageResources and Evaluation (LREC’18), 2018.

Thomas Proisl and Peter Uhrig. SoMaJo: State-of-the-art tokenization for Germanweb and social media texts. In Paul Cook, Stefan Evert, Roland Schäfer, and EgonStemle, editors, Proceedings of the 10th Web as Corpus Workshop (WAC-X) andthe EmpiriST Shared Task, pages 57–62, Berlin, 2016. Association forComputational Linguistics.

Toshinori Sato, Taiichi Hashimoto, and Manabu Okumura. Implementation of a wordsegmentation dictionary called mecab-ipadic-neologd and study on how to use iteffectively for information retrieval (in japanese). In Proceedings of theTwenty-three Annual Meeting of the Association for Natural Language Processing,pages NLP2017–B6–1. The Association for Natural Language Processing, 2017.

Fabian Schäfer, Stefan Evert, and Philipp Heinrich. Japan’s 2014 General Election:Political Bots, Right-Wing Internet Activism and PM Abe Shinzo’s HiddenNationalist Agenda. Big Data, 5:1 – 16, 2017.

L.J.P van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Usingt-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.

Philipp Heinrich, MSc (FAU) Corpus-Based Discourse Analysis October 1, 2018 32 / 32