from gene sequencing to genre sequencing: a corpus-based ... · reconciling genre and corpus...
TRANSCRIPT
From gene sequencing to genre sequencing: A corpus-based analysis of British patents of invention, 1711 -2011Nicholas Groom and Jack GrieveCentre for Corpus Research Department of English Language and Applied Linguistics
Aims
1. To present a novel corpus-based methodology for diachronic genre analysis
2. To use this method to identify changes in the patent specification genre over three centuries
3. To address a fundamental theoretical question in genre studies
The paradox of genre
o Genres are, by definition, ‘generic’, which is to say, relatively consistent from one instance to the next.
o And yet we know that genres change over time, sometimes quite radically.
o How does genre change happen?
How does genre change happen?o Is it Darwinian, i.e. a constant and gradual
process of natural selection?
How does genre change happen?o Is it Darwinian, i.e. a constant and gradual
process of natural selection?o Or is it Kuhnian, i.e. characterized by periods
of stability punctuated by sudden and dramatic ‘paradigm shifts’?
How does genre change happen?o Diachronic studies of the scientific research
article genre (e.g. Gross et al 2002): gradual ‘evolution’ from C17th to the present
o Psychiatric case history genre (Berkenkotter2009): periods of stability punctuated by two ‘revolutions’: 1. Freud (c.1905)2. American Psychiatric Association DSM III (1980)
o Studies of more genres needed!
‘Genre’
o What do we mean by ‘genre’?o In everyday language (and in literary theory),
‘genre’ ≈ ‘text type’o For linguists and rhetoricians, ‘genre’ ≠ ‘text
type’.o This is because form and function do not
always match up
Same form, different functions
Different forms, same function
‘Genre’
o Linguists and rhetoricians define genre in terms of social function rather than textual form
‘Genre’
o Linguists and rhetoricians define genre in terms of social function rather than textual form
o Rhetorical theory:– Miller (1984: 159): genres are “typified
rhetorical actions based in recurrent situations”
‘Genre’
o Linguists and rhetoricians define genre in terms of social function rather than textual form
o Systemic-Functional Linguistics:– Martin (2005: 13): “genre represents the
system of staged goal-oriented social processes through which social subjects in a given culture live their lives.”
‘Genre’
o Linguists and rhetoricians define genre in terms of social function rather than textual form
o English for Specific Purposes (ESP):– Swales (1990: 58): “A genre comprises a
class of communicative events, the members of which share some set of communicative purposes.”
‘Genre’
o Linguists and rhetoricians define genre in terms of social function rather than textual form
o Corpus linguistics:– Biber (1988: 170): “Genre categories are
determined on the basis of external criteria relating to the speaker's purpose and topic; they are assigned on the basis of use rather than on the basis of form.”
Reconciling genre and corpus
o Linguists and rhetoricians define genre in terms of social function rather than textual form.
Reconciling genre and corpus
o Linguists and rhetoricians define genre in terms of social function rather than textual form.
o However …“… while genre is not limited to its form, form is indeed an important aspect of genre” (Tardy & Swales 2014: 166).
Reconciling genre and corpus
o Bazerman (1988: 62): “the formal features that are shared by the corpus of texts in a genre and by which we usually recognize a text’s inclusion in a genre, are the linguistic/symbolic solution to a problem in social interaction”.
Reconciling genre and corpus
o Bazerman (1988: 62): “the formal features that are shared by the corpus of texts in a genre and by which we usually recognize a text’s inclusion in a genre, are the linguistic/symbolic solution to a problem in social interaction”.
o So, which “formal features” should corpus linguists focus on?
Which ‘formal features’?
o Tardy & Swales (2014: 166-167): “Users – and in some cases, non-users –generally recognize a genre based on formal features like lexis, grammar, organizational patterns, topics, and even document format and associated visuals.”
Which ‘formal features’?Biber and Conrad (2009: 16):
Which ‘formal features’?
Tardy & Swaleso lexiso grammaro organizational
patternso topicso document format and
associated visuals
Biber & Conrado specialized
expressionso rhetorical organizationo formattingo usually once-occurring
in the text, in a particular place in the text
Which ‘formal features’?
Tardy & Swaleso lexiso grammaro organizational
patternso topicso document format and
associated visuals
Biber & Conrado specialized
expressionso rhetorical organizationo formattingo usually once-occurring
in the text, in a particular place in the text
Which ‘formal features’?
Tardy & Swaleso lexiso grammaro organizational
patterns
o document format and associated visuals
Biber & Conrado specialized
expressionso rhetorical organizationo formattingo usually once-occurring
in the text, in a particular place in the text
Previous work
o Most previous corpus-based genre studies have focused on ‘rhetorical organization’– a.k.a. ‘corpus-based move analysis’– E.g. Biber et al (2007); Upton & Cohen
(2009)
Our approach
o We try to stay ‘on the surface’ as much as possible, and focus strongly on the sequencing of textual elements (hence ‘sequence analysis’ rather than ‘move analysis’)
Our approach
o Aim: describe sequencing of formal features– in individual exemplar texts, and– across texts diachronically
o … in order to see how prototypical generic forms become established, and how they change over time
o Empirical focus: patent specification genre, 1711-2011
Patentso Intellectual property protection for inventions with
industrial applicability.o Rationale for patenting:
– Inventor potentially benefits financially from period of protection
– This in turn incentivizes scientific and technological innovation
– The public benefits from this, and from requirement that patent must describe invention in detail; knowledge becomes public property on expiry of patent.
History
o ‘Patents’ ← ‘Letters Patent’, i.e. ‘open letter’ o = royal proclamation granting a right (written
records from 1201).o Many issued for (often spurious) monopoly rights.o Statute of Monopolies 1623: abolished all
manufacturing and commercial patents except for those granted for “the sole workinge or makingeof any manner of new manufactures within this Realme, to the true and first inventor or inventors of such manufactures”.
Historyo Until end of C17th, patent grant was on condition
that the inventor would, after a period of seven years, take on apprentices “and teach them the knowledge and mystery of the said new invention”.
o During the early C18th, transmission of knowledge via apprenticeships was replaced by requirement for the inventor to lodge a written specification, describing the invention in full.
o So, patents played an important role in shift from oral to literate culture
Historyo The patenting system was a key driver of the
Industrial Revolution (1760 - c.1840) (Nuvolari & Tartari 2011; Bottomley 2014).
Source: Nuvolari and Tartari (2011: 102)
Historyo By mid C19th, modern patent systems were
being established in the UK, USA and elsewhere.
o C20th: emergence of international regimes, e.g. EPO (1949) and PCT (1970)
o Patents Act 1977: assimilated UK patents into European system.
Today
Source: http://www.wipo.int/ipstats/en/charts/ipfactsandfigures2016.html
Why patents?
o Important and historically significant genre.o Studied extensively in some fields (e.g. NLP,
legal and economic history, science and technology studies, rhetorical studies …)
o But hardly studied at all by linguists.o We hope to change this!o Ideal focus for investigating our theoretical
question.
Why patents?o Unusually, the patent specification is a genre
that can be traced all the way back to its very first exemplar: Nasmith’s patent, 1711
Why patents?
o We know that the patent specification genre has changed dramatically over the last 300 years
1711 2011
Why patents?
o But how has this change happened: through gradual and constant ‘evolutionary’ modifications, or through sudden and dramatic ‘revolutionary’ shifts?
Corpuso BLEPAS o British Library & Espacenet Patent Archive
Sampleo One text per year from 1711 to 2011 o ‘long and thin’ as opposed to ‘short and fat’
(Rissanen 2000; Kohnen 2007). (Ultimate aim = long and fat!)
Dataset
o Dataset for current analysis covers 276 years between 1734 and 2011. The underlying dataset also includes texts from 1711 to 1733, but there are only 4 years with data in that span, so we have excluded them from this preliminary analysis.
o We also lack data for 1739 and 1758 because no patents were issued in those two years.
Dataset
o The dataset for the current study consists of a list of 276 short strings of alphabetical characters.
o Each string represents the generic structure of a single randomly selected patent for each year from 1734 to 2011.
How dataset was built
Step 1: We reduced each text in the corpus to a code string representing that text as a sequence of ‘formal features’
SALUTATION
f
Declaration of grant of patent
Declaration of grant of patent
a
Statement of condition of grant
b
Description of inventionc
c (continued)
Witness statement and signature d
Other witness signature(s)h
Confirmation that specification has been enrolled within specified time limit e
Drawings
i
How dataset was builtStep 1: We reduced each text in the corpus to a code string which represents that text as a sequence of generic features
= fabcdhei
How dataset was builtStep 2: We recorded each code string in a spreadsheet
How dataset was built
Step 3: We read the spreadsheet into R as a dataframe for further processing and analysis.
Overview of sequence types74 sequence types (final set will be smaller!)Many are slight variants of most frequent types
Diachronic distribution of sequence typeso How are these different sequence types
distributed across the period of our study?
or ???
Diachronic distribution of sequence typeso To answer this question, we use string edit
distance (commonly used in DNA sequencing)
o String edit distance measures the number of operations (i.e. insertion, deletion, substitution) needed to transform one string to another.
o E.g. gene genre has a string edit distance of 1.
String edit distance analysis
o We use the stringdist() function in R, applying the default optimal string alignment metric (OSA), also known as restricted Damereau-Levenshtein distance.
String edit distance analysis
o First, we plot string edit distance between all adjacent patent sequences.
Adjacent string edit distance
Adjacent string edit distance
Multivariate Analysis of String Edit Distanceo Next, we compute string edit distance
between all years, using this information to cluster patents by year.
o First, we make a distance matrix of edit distances between all pairs of strings using the stringdistmatrix() function in R.
o Then we run a simple metric multidimensional scaling to reduce this matrix down to two dimensions.
Multivariate Analysis of String Edit Distanceo To make clusters more clearly visible, we also
analyzed the distance matrix by applying a hierarchical cluster analysis (using Ward’s method).
o This yields 5 main clusters.
Cluster dendrogram
Multivariate Analysis of String Edit Distanceo Interestingly, these 5 main clusters turn out to
be identified with distinct time periods when plotted on a timeline:
Patent Law Amendment Act 1852
Patents, Designs and Trademarks Act, 1883; Paris Convention 1883.
European Convention on the International Classification of Patents 1954
Patents Act 1977
Individual move analysis
o We also decided to trace the diachronic distribution of each of the individual elements in our coding scheme (‘move codes’), regardless of position within each string.
o Interestingly, many of these seem to appear and disappear extremely abruptly during the period of the analysis:
Move f: Salutation
Move a: Declaration of grant of patent
Move b: Statement of condition of grant
Move z: Abstract
Individual move analysis
o Is this abruptness simply an artefact of our ‘one-text-per-year’ sampling method?
o NO: a few moves do appear and disappear more gradually:
Move J: Drawings
Move h: Statement of petition
Interim conclusionso Genre change: evolution or revolution?o For patents: both processes can be
observed; seems to depend to some extent on the kind of analysis applied to the data.
o Variation from one year to the next is a constant process of (mainly) gradual change
o Shifts between broad generic sequence types are sudden and dramatic – external forces?
o Individual moves can appear and disappear suddenly or gradually.
Ongoing and future worko Currently refining/reducing move categories!o N-gram analysis of move sequences in each
cluster time period – our hypothesis is that most frequent (i.e. dominant) variant will appear late in each period (natural selection)
o Analysis of individual move positions over time – do they change or stay in the same place?
o Lexicogrammatical analysis of patents/moves using MDA.
Thank you!
ReferencesBazerman, C. (1988). Shaping written knowledge: The genre and activity of the experimental article in science. Madison: University of Wisconsin Press. Berkenkotter, C. (2009). Patient tales: Case histories and the uses of narrative in psychiatry. Columbia, SC: University of South Carolina Press.Biber, D. (1988) Variation across Speech and Writing. Cambridge: Cambridge University Press.Biber, D. & Conrad, S. (2009). Register, Genre and Style. Cambridge: Cambridge University Press. Biber, D, Connor, U. & Upton, T.A. (2007). Discourse on the move: Using corpus analysis to describe discourse structure. Amsterdam: John Benjamins.Bottomley, S. (2014). The British patent system during the industrial revolution 1700–1852: From privilege to property. Cambridge: Cambridge University Press.Devitt, A. (2004). Writing genres. Carbondale, IL: Southern Illinois University Press.Gross, A. G., Harmon, J. E., & Reidy, M. (2002). Communicating science. The scientific article from the 17th century to the present. Oxford: Oxford University Press. Kohnen, T. (2007). ‘From Helsinki through the centuries: the design and development of English diachronic corpora.’ In P. Pahta, I. Taavitsainen, T. Nevalainen & J. Tyrkko (Eds.), Towards Multimedia in Corpus Studies (Studies in Variation, Contacts and Change in English 2). Helsinki: VARIENG. http://www.helsinki.fi/varieng/series/volumes/02/kohnen/
ReferencesMartin, J.R. (2005) ‘Analysing genre: functional parameters.’ In J.R. Martin & F. Christie (eds.) Genre and Institutions: Social Processes in the Workplace and School. London: Cassell, 3-39.Miller, C. (1984). Genre as social action. Quarterly Journal of Speech, 70, 151-76. Nuvolari, A. & Tartari, V. (2011) ‘Bennet Woodcroft and the value of English patents, 1617–1841.’ Explorations in Economic History 48: 97-115.Rissanen, M. (2000). ‘The world of English historical corpora: From Cædmon to computer age.’ Journal of English Linguistics 28/1: 7-20.Swales, J. M. (1990). Genre Analysis: English in academic and research settings. Cambridge: Cambridge University Press. Swales, J. M. (2004). Research Genres: Exploration and applications. Cambridge: Cambridge University Press. Tardy, C.M. & Swales, J.M. (2014). ‘Genre analysis.’ In K.P. Schneider and A. Barron (Eds.) Pragmatics of discourse (pp.165-187). Berlin, Germany: Walter de Gruyter.Upton, T.A. & Cohen, M.A. (2009). An approach to corpus-based discourse analysis: The move analysis as example. Discourse Studies, 11, 585-605.