studying internet text
DESCRIPTION
Studying Internet Text. Mike Scott University of Liverpool 28 October 2005 ICIL 05 Castellón. Internet…. - PowerPoint PPT PresentationTRANSCRIPT
Studying Internet Text
Mike ScottUniversity of Liverpool
28 October 2005ICIL 05
Castellón
Internet…
Home was BAMA, the Sprawl, the Boston-Atlanta Metropolitan Axis. Program a map to display frequency data exchange, every thousand megabytes a single pixel on a very large screen. Manhattan and Atlanta burn solid white. Then they start to pulse, the rate of traffic threatening to overload your simulation. Your map is about to go nova. Cool it down. Up your scale. Each pixel a million megabytes. At a hundred million megabytes per second, you begin to make out certain blocks in midtown Manhattan, outlines of hundred-year-old parks ringing the old core of Atlanta.
(William Gibson, Neuromancer, 1984, page 57).
… or Google Earth?
Issues and Questions
The Internet as a Resource InterNET Characteristics of networks Corpus Linguistics (CL) and Internet text Patterns of interest to the language learner
Internet Map
UK Janet network 2001
another way of viewing it
Networks
Milgram’s experiments (1960s) 160 letters sent out asking random people in
Nebraska & Kansas to forward the letter to a person in Boston, but without the address.
Most of the letters got through. In only about 6 steps.
Networks
Graph Theory You want to link 50 towns with a road network, but
don’t want to build 1,225 roads (50 * 49 ÷ 2). Erdös proved in 1959 that 98 random roads (8%) will
ensure the great majority get linked. In general, for larger networks, you need only a tiny
percentage of the possible links to get a network which works (traffic gets through).
For a network of 6 billion people, you need 0.000000004%, which is about 24 links (acquaintances).
Messages will get through from anyone .. to anyone.
Power Law
Nodes and connections obey a “power law”: “each time the number of links doubles, the number of nodes with that many links becomes less by about five times”. (Buchanan 2002: 83)
Are words in text anything like these networks?
Internet
a “scale-free” network “The probability
distribution of incoming links to HTML documents… follows a power law, generating a straight line on this logarithmic plot. The outgoing links have a similar distribution. This implies that the WWW is a scale-free network”. (Ball 2004:480)
Word Frequency lists
Zipf’s rank-frequency distribution of words (Zipf, 1965: 25)
(A) “The James Joyce data; (B) the Eldridge data; (C) ideal curve with slope of negative unity.” (original caption)
Word Frequency lists — BNC
Zipf plot of word frequencies & ranks (Scott & Tribble in press)
Based on whole BNC, nearly 400,000 types
1
1
Frequency
Rank
Corpus Linguistics
Uncertain status as a discipline Innovative in methodology Focus on “the language” relatively unfiltered data
this?
or this?
as opposed to
Internet text
Google “Google examines more than 8 billion web
pages to find the most relevant pages for any query and typically returns those results in less than half a second. No other search engine accesses more of the Internet or delivers more useful information than Google.” (http://www.google.co.uk/corporate/features.html)
But there are more sites
islands sites not found by web-bots sites not indexed by web-bots … so not all the Internet can be seen
The problem: what verb goes with “battle”? hold? fight? win? take? there + be? struggle? combat? pitch?
Dictionaries
OED: “join, give, refuse, accept, offer, do battle”
Oxford Advanced Learner’s 1974: no verbs supplied
Cobuild 1988: examples show “fought” and “do battle”
LTP Dictionary of Selected Collocations Verbs to the left: engage in, fight, force, go
into, join in, lose, take part in, win ~ Verbs to the right: ~ continues, dragged on,
ended in stalemate, is in progress, raged Adj: bitter, bloody, crucial, decisive, fierce,
final, hopeless, important, last-ditch, long, long-running, major, mock, pitched, real, relentless, running, successful ~
Phrases: fight a losing ~, outcome of ~
battle
fight battle
Webgetter
Settings: English only, minimum 100 words
Webgetter
In approx. 600,000 words, “battle” occurs nearly 4,000 times, about once every 150 words.
“An epic battle rages between the Forseti and the Muspell as the oceans rise and land disappears. The Forseti compel you to help protect their remaining land by taking charge of the ultimate war machine – the Battle Engine. Whether in walking or in flying mode, you have access to an array of destructive weapons and you receive constant direction from base command. By commanding a device so powerful and advanced, your battlefield decisions will shape the direction of each engagement and, ultimately, the entire war.”
Webgetter results
Collocated verbs in top 100 linked by MI score: cheats(10 occurrences) “Battle engine Aquila
cheats”(? is this a verb?) gaming (9) fought (43) is number 110
Clusters: “battle was fought” (6)
BNC (written)
In 90 million words, “battle” comes over 6,000 times, once every 14,000 words.
Collocated verbs in top 100 linked by MI score: fought(153)/fighting(93) rages(5)/raged(12) waged(10)/waging(12) ensued(8)/ensuing(13) defeated(39) losing(68) won(152) commence(5)
BNC Written clusters
to do battle (54) fighting a losing (24) win the battle (22) won the battle (22) fighting a losing battle (21) to fight a (15)
Conclusions (1)
The Internet is a powerful linked scale-free network with the capacity of linking nodes efficiently and fast, and is relatively robust
Connections within the Internet have characteristics of a power law
Word frequency lists share these characteristics … … suggesting that grammar words are like Google.
Yahoo, Microsoft web-sites, extremely often visited… …but not in themselves informative and other sites we visit are like lexical words… …less visited but more informative
Conclusions (2)
The learner wants to know how words collocate
Collocation dictionaries – but not other dictionaries – give useful information
but no examples or not enough Internet text is often strangely structured after all the Internet is merely a noticeboard New and often strange text-types or uses of
familiar words
Conclusions (3)
The concordance + BNC gives a better view for the language learner, through
concordance lines collocates clusters
References: Ball, Philip, 2004. Critical Mass. London: Arrow. Barábasi, Albert-Lásló, 2002, Linked: the new science of networks. Cambridge,
Mass.: Perseus. Buchanan, Mark, 2002, Small World: uncovering nature’s networks. London:
Weidenfeld & Nicholson. Hill, J. & Lewis, M. 1997. LTP Dictionary of Selected Collocations. Hove:
Language Teaching Productions. Nation, I.S.P., 2001, Learning Vocabulary in Another Language. Cambridge:
Cambridge University Press. P53.9.N27 Faloutsos, Michalis, Petros Faloutsos & Christos Faloutsos, 1999, “On Power-
Law Relationships of the Internet Topology” in Applications, Technologies, Architectures,and Protocols for Computer Communication. Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. Cambridge, Mass.: ACM Press. pp. 251-62.
Scott, Mike & Chris Tribble (in press) Working with Texts. Amsterdam: Benjamins.
Zipf, G. K. 1965. Human Behavior and the Principle of Least Effort, New York: Hafner. (facsimile of 1949 edition).
http://www.cybergeography.org/atlas/topology.html