language and the internet assessing linguistic bias measuring the information society wsis, tunis,...

33
Language and the Language and the Internet Internet Assessing Linguistic Bias Assessing Linguistic Bias Measuring the Information Measuring the Information Society Society WSIS, Tunis, November 15, WSIS, Tunis, November 15, 2005 2005 John C. Paolillo, Indiana John C. Paolillo, Indiana University University

Upload: travis-wissler

Post on 28-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Language and the InternetLanguage and the InternetAssessing Linguistic BiasAssessing Linguistic Bias

Measuring the Information SocietyMeasuring the Information Society

WSIS, Tunis, November 15, 2005WSIS, Tunis, November 15, 2005

John C. Paolillo, Indiana UniversityJohn C. Paolillo, Indiana University

Page 2: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

OverviewOverview

• Sources of Linguistic BiasSources of Linguistic Bias• Linguistic Bias: examplesLinguistic Bias: examples

– Text CommunicationText Communication– Internet Host NamesInternet Host Names– Web ProgrammingWeb Programming

• Global Linguistic DiversityGlobal Linguistic Diversity– Who bears the costs?Who bears the costs?

• ConclusionsConclusions

Page 3: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Sources of Linguistic BiasSources of Linguistic Bias

(Friedman and Nissenbaum 1997)(Friedman and Nissenbaum 1997)

• Pre-existingPre-existing– originate from outside the technical systemoriginate from outside the technical system

• National, trans-national and institutional policiesNational, trans-national and institutional policies• Technology companiesTechnology companies

• TechnicalTechnical– are built into the technical system itselfare built into the technical system itself

• Developers’ language backgrounds, national originsDevelopers’ language backgrounds, national origins• Legacy standards, “backward” compatibilityLegacy standards, “backward” compatibility

• EmergentEmergent– arise in specific contexts of use of a technical systemarise in specific contexts of use of a technical system

• Economics of technology industry (marketing, monopoly power, Economics of technology industry (marketing, monopoly power, unstable markets, etc.)unstable markets, etc.)

• Rapid technologizationRapid technologization

Page 4: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Text CommunicationText Communication

• Requires an Requires an encodingencoding and its support and its support– Assign code numbers to script charactersAssign code numbers to script characters

• ASCII (American English)ASCII (American English)• ISO-8859-1 (European Languages)ISO-8859-1 (European Languages)• Unicode (most languages, but support is uneven)Unicode (most languages, but support is uneven)

– Support means many thingsSupport means many things• Fonts, rendering, sorting, spell-checking etc.Fonts, rendering, sorting, spell-checking etc.

• Computer-Mediated CommunicationComputer-Mediated Communication– Web pages, Email, chat, etc.Web pages, Email, chat, etc.– Language use is not uniform in these modesLanguage use is not uniform in these modes

• Multilinguals tend to favor different languages for specific Multilinguals tend to favor different languages for specific purposespurposes

• Represents both technical and emergent biasesRepresents both technical and emergent biases

Page 5: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Unicode Status: ExamplesUnicode Status: ExamplesLanguageChineseEnglishFrenchGermanSpanishFinnishRussianArabicHindiSinhalaS. Azerbaijani

Unicodeyesyesyesyesyesyesyesyesyesyesno

Browsergoodgoodgoodgoodgoodgoodgood (late)

good (late)

poornonenone

ScriptChineseRomanRomanRomanRomanRomanCyrillicArabicIndicIndicArabic

Pop.1,240M

400M81M82M

358M5M

132M247M213M

15M26M

Good support

Poor support

No support

Page 6: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Internet Host NamesInternet Host Names

• The Domain Name SystemThe Domain Name System– Uses a 30-year old 7-bit ASCII standardUses a 30-year old 7-bit ASCII standard

• Now supports Punycode (a variant of Unicode)Now supports Punycode (a variant of Unicode)• Imposes a maximum name lengthImposes a maximum name length

– Run by ICANN under US Dept of Commerce contractRun by ICANN under US Dept of Commerce contract• More concerned with trademark protectionMore concerned with trademark protection• Host/domain naming is widely abused (e.g. tv domain)Host/domain naming is widely abused (e.g. tv domain)• Names provided by the DNS are not that usefulNames provided by the DNS are not that useful

• An example of emergent biasAn example of emergent bias– Technical originTechnical origin– Economic and political forces amplify and sustain itEconomic and political forces amplify and sustain it

Page 7: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Web Programming and UnicodeWeb Programming and Unicode

• Markup & web scripting languagesMarkup & web scripting languages– Unicode is standardUnicode is standard– Browser support, fonts, etc. lag behindBrowser support, fonts, etc. lag behind– Databases and development environments tend to Databases and development environments tend to

lack proper Unicode supportlack proper Unicode support– End-user oriented, End-user oriented, notnot programmer oriented programmer oriented

• All of the most important technologies are Open-All of the most important technologies are Open-Source software (FLOSS)Source software (FLOSS)– User extensible/modifiableUser extensible/modifiable– Language localization of these is possible but rareLanguage localization of these is possible but rare

Page 8: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Linguistic Bias in Web Linguistic Bias in Web ProgrammingProgramming

• English is the source language for most English is the source language for most programming & markup languagesprogramming & markup languages– Keywords Keywords – Operator-argument orderOperator-argument order– Programming constructs, etc.Programming constructs, etc.

• Programming as a linguistic actProgramming as a linguistic act– Complex concepts are rendered into textComplex concepts are rendered into text– Different languages have different ways of doing Different languages have different ways of doing

thisthis• Emergent language biasesEmergent language biases

Page 9: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Linguistic Properties of Linguistic Properties of ProgrammingProgramming

• LISPLISP– Predicates precede their arguments Predicates precede their arguments

• Like Arabic, Celtic, Hebrew, etc.Like Arabic, Celtic, Hebrew, etc.

(defun fact (x)(if (<= x 0) 1 (* x (fact (- x (defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1)))))1)))))

• PostscriptPostscript– Predicates follow their argumentsPredicates follow their arguments

• Like Farsi, Hindi, Japanese, Tamil, Turkish, etc.Like Farsi, Hindi, Japanese, Tamil, Turkish, etc.

/factorial { dup 1 gt { dup 1 sub factorial mul } if } def/factorial { dup 1 gt { dup 1 sub factorial mul } if } def

Page 10: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

The Linguistic Digital DivideThe Linguistic Digital Divide

• Language issues go beyond contentLanguage issues go beyond content– WSIS repeatedly re-affirms principles ofWSIS repeatedly re-affirms principles of

• TransparencyTransparency• Self-determinationSelf-determination• Open access to participation for all partiesOpen access to participation for all parties

These principles cannot be guaranteed unless speakers of These principles cannot be guaranteed unless speakers of different languages can manipulate different languages can manipulate allall aspects of IT use in a way aspects of IT use in a way that is native-likethat is native-like

• The linguistic divide has broader consequencesThe linguistic divide has broader consequences– Costs are borne in Costs are borne in

• Education — great for non-English speaking peopleEducation — great for non-English speaking people• Technical development — small, in comparisonTechnical development — small, in comparison(there is a trade-off)(there is a trade-off)

Page 11: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Language DiversityLanguage Diversity

Who bears the costs?Who bears the costs?

Page 12: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Distribution of language groups by size

0

200

400

600

800

1000

1200

1400

0.0001 0.01 1 100 10000 1000000

Population (in thousands)

Number of groups

(source data: www.ethnologue.com)

A typical language group has around 10-50 thousand people80% of language groups have fewer than 100 thousand members

Page 13: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Cumulative proportion of world's population

0

0.2

0.4

0.6

0.8

1

0.00010.0010.010.11101001000

Language group population (millions)

Cumulative proportion

(source data: www.ethnologue.com)

90% of the world’s population belongs to a language group with at least 1 million people (416 groups)

Many languages with hundreds of milloins of speakers lack adequate support

Page 14: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Worldwide Linguistic Diversity by Region

W AsiaSC Asia

S America

Europe

SE AsiaOceania

Africa

USA

N America

E Asia

(source data: www.ethnologue.com)

Per-Country Linguistic Diversity by Region

USAE Asia

W Asia

SC Asia

S AmericaSE Asia

Oceania

Africa

N America

Europe

Page 15: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

ConclusionsConclusions

• Linguistic Bias is manifest in many waysLinguistic Bias is manifest in many ways– Technical biases are sometimes overtTechnical biases are sometimes overt– Emergent biases can be subtleEmergent biases can be subtle

• All potential sources of bias need to be All potential sources of bias need to be examined and questioned if we are to uphold examined and questioned if we are to uphold principles affirmed by WSIS principles affirmed by WSIS

• Without this effort, the linguistic digital divide Without this effort, the linguistic digital divide will simply amplify existing disparities in will simply amplify existing disparities in wealth and powerwealth and power

Page 16: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University
Page 17: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Language DiversityLanguage Diversity

On The InternetOn The Internet

Page 18: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Estimated populations of Internet users (millions)

1

10

100

1000

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

EnglishChineseJapaneseSpanishGermanKoreanFrenchItalianPortugueseScandinavianDutchOther

Global Reach

Page 19: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Linguistic DiversityLinguistic Diversity

Based on Entropy:Based on Entropy:

Diversity = –2 ∑pDiversity = –2 ∑pii ln p ln pii

Diversity is the long-run per-individual Diversity is the long-run per-individual average variance in language categoryaverage variance in language category

(similar to log-likelihood)(similar to log-likelihood)

Page 20: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Linguistic Diversity of Internet Users

0

1

2

3

4

5

6

7

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Diversity Index

minimum

maximum

Page 21: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Languages on the Web

English72%

other2%

Swedish1%

Russian1%

Finnish1%

German7%

Japanese3%

French3%

Spanish3%

Portuguese2%

Chinese2%

Dutch1%

Italian2%

O’Neill, Lavoie and Bennett, 2003

Page 22: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Internet Hosts (www.isc.org/ds)

0

20

40

60

80

100

120

140

160

180

200

 1995-01 1996-01 1997-01 1998-01 1999-01 2000-01 2001-01 2002-01 2003-01

Millions

www.isc.org/ds

Page 23: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Host growth by region (millions)

0.001

0.01

0.1

1

10

100

19951996199719981999200020012002

USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfricaOther

www.isc.org/ds

Page 24: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Users per host by region

0.1

1

10

100

1998 1999 2000 2001

USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica

www.isc.org/ds, ITU

Page 25: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Proportion of random .com hosts

United StatesCanadaNetherlandsAustraliaUnknownUnited KingdomHongkongIsrael

Proportion of random .net hosts

United StatesAustraliaNetherlandsUnknownCanadaGermanyJapan

Page 26: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

User growth by region (millions)

1

10

100

1000

1998 1999 2000 2001

USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica

ITU

Page 27: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Proportion of Internet hosts by region

0

0.1

0.2

0.3

0.4

0.5

0.6

1995 1996 1997 1998 1999 2000 2001 2002

USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfricaOther

www.isc.org/ds

Page 28: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Host growth by region (millions)

0.001

0.01

0.1

1

19951996199719981999200020012002

USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfricaOther

www.isc.org/ds

Page 29: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Internet hosts per thousand inhabitants by region

0.001

0.01

0.1

1

10

100

1000

1995199619971998 1999200020012002

USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica

www.isc.org/ds, UNPD

Page 30: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

Users per thousand inhabitants by region

1

10

100

1000

1998 1999 2000 2001

USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica

ITU, UNPD

Page 31: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

User growth by region (millions)

1

10

100

1000

1998 1999 2000 2001

USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica

ITU

Page 32: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

www.isc.org/ds, ITU

Page 33: Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

1067 Random Hosts (all domains)