Transcript

Bringing Data Science to the Speakers of Every Language!

!!!!!!!

Robert Munro, PhD!CEO, Idibon!

idibon!

About me: technology and global development!

CEO, Idibon!Global text analytics in 50+ languages!Working with leaders in industry & social good!

Industry: CTO / CIO!Energy infrastructure in Liberia and Sierra Leone!Global epidemic tracking!Crowdsourcing and natural language processing for disaster response!

Other!Ph.D. in NLP from Stanford!Bicycled 20+ countries!

idibon!

Recommendations for language processing for social good!

Look beyond English!Inherent benefit understanding and support speakers of every language!!

Employ people in those languages!Crowdsourced workers speak 100s of languages, and want to use them!!

Embrace the variation!You can’t rely on consistent spellings, but you can learn to model the diversity!!

idibon!

How many languages are in the connected world?!

5 5 5 5 5 5 4.5 4

50

1500

5000

2000

1400

720540 500

!"#$%

&%'(%)#*+,#+"-%

.'/%0#-%102-%30#*+"45%

idibon!

5 5 5 5 5 5 4.5 4

50

1500

5000

2000

1400

720540 500

How many languages are in the connected world?!

.'/%0#-%102-%30#*+"45%

!"#$%

&%'(%)#*+,#+"-%

idibon!

5 5 5 5 5 5 4.5 4

50

1500

5000

2000

1400

720540 500

How many languages are in the connected world?!

!"#$%

&%'(%)#*+,#+"-%

idibon!

How many languages are in the connected world?!

5 5 5 5 5 5 4.5 4

50

1500

5000

2000

1400

720540 500

!"#$%

&%'(%)#*+,#+"-%

6,7*+%#%80'*"%2*%10"%0#*4-%'(%"9"$:'*"%'*%10"%8)#*"1%2-%10"%"#-:%8#$1%

;*4"$-1#*42*+%"9"$:'*"%2-%+'2*+%1'%<"%='$"%3'=8)23#1"4%

idibon!

Every human communication this year!

Source: Ethnologue, Nationalencyklopedin

idibon!

7% of our communications are digital, most is still direct spoken language!

idibon!

If every online picture is worth a thousand words, it would double social media.!

Every picture!

idibon!

Every 3 months, the world's text messages exceed the word count of every book.!

Every book. Ever.!

Source: Google Books

idibon!

Print communication is smaller than anything shown.!!

idibon!

Print communication is smaller than anything shown.!Ditto any one social network.!

idibon!

The Twitter “firehose” is about the size of the dot above the “i” in English.!

!Beyond the processing capacity of most organizations.!!Might not be a representative sample of all human activity for your area of interest.!

idibon!

There are more than 6,000 other languages. !Only the top 1% are shown.!

idibon!

No language from the Americas made the cut.!

Quechua!

idibon!

Email spam would be larger than every block except spoken Mandarin ( ).!

Source: Mashable

idibon!

Short messages (SMS and IM) make up 2% of the world’s communications.!The largest and most linguistically diverse form of written communication that has ever existed.!!# PhDs focused on processing large volumes of short messages in low resource languages?!! ! ! ! ! ! !1!

idibon!

If the Facebook “like” is a one-word language it is in the top 5% of languages by word count.!

idibon!

Your browser probably won't show Sundanese script !(!! !!!!!)!!!!!!)

idibon!

Combined.!

Sundanese speakers outnumber the populations New York, London, Tokyo and Moscow. !

idibon!

You misread “Sundanese” as "Sudanese" which is a variety of Arabic !

We have a blind spot for knowing about the existence of languages.!

idibon!

This is the breakdown of languages that most of our data is moving towards!

idibon!

January 12, 2010!>*%"#$10?,#@"%-1$,3@%.#2A%'*%B#*,#$:%CDE%DFCF%%

G'-1%)'3#)%-"$923"-%(#2)"4E%<,1%='-1%3"))H1'/"$-%$"=#2*"4%(,*3A'*#)I%%%

idibon!

Messages start streaming in!

idibon!

Messages start streaming in!

idibon!

Mission 4636!

!"##$%"&'($)#*$'"+,&

-$'"%.(/0"+&1&%".*.-$'"+

2.-$'/.)&/#&("3/)"+&1&

$-'/.)$4*"&/'"5#&$("&/+")'/3/"+

“Fanm gen tranche pou fè yon pitit nan Delmas 31”

“Fanm gen tranche pou fè yon pitit nan Delmas 31”

Undergoing children delivery Delmas 31

18.495746829274168, 72.31849193572998

Emergency

(18.4957, -72.3185)(18.4957, -72.3185)

“Fanm gen tranche pou fè yon pitit nan Delmas 31”

Undergoing children delivery Delmas 31

18.495746829274168, 72.31849193572998

Emergency

idibon!

Global collaboration!

2,000 volunteers, transferred to paid workers in Haiti!

idibon!

J'821#)%K#3$"HL'",$%@2%*#*%92)%M@#8E%8$"%8',%)2%$"-"9/#%=',*%=#)#4%"%)#8%=#*4"%8',%=',*%@2%=#)#4%:'%#)"%)#I%%NK#3$"HL'",$%.'-821#)%/0230%)'3#1"4%2*%102-%92))#+"%'(%M@#8%2-%$"#4:%1'%$"3"29"%10'-"%/0'%#$"%2*O,$"4I%P0"$"('$"E%/"%#$"%#-@2*+%10'-"%/0'%#$"%-23@%1'%$"8'$1%1'%10#1%0'-821#)IQ%%

idibon!

J'821#)%K#3$"HL'",$%@2%*#*%92)%M@#8E%8$"%8',%)2%$"-"9/#%=',*%=#)#4%"%)#8%=#*4"%8',%=',*%@2%=#)#4%:'%#)"%)#I%%NK#3$"HL'",$%.'-821#)%/0230%)'3#1"4%2*%102-%92))#+"%'(%M@#8%2-%$"#4:%1'%$"3"29"%10'-"%/0'%#$"%2*O,$"4I%P0"$"('$"E%/"%#$"%#-@2*+%10'-"%/0'%#$"%-23@%1'%$"8'$1%1'%10#1%0'-821#)IQ%%

idibon!

J'821#)%K#3$"HL'",$%@2%*#*%92)%M@#8E%8$"%8',%)2%$"-"9/#%=',*%=#)#4%"%)#8%=#*4"%8',%=',*%@2%=#)#4%:'%#)"%)#I%%NK#3$"HL'",$%.'-821#)%/0230%)'3#1"4%2*%102-%92))#+"%'(%M@#8%2-%$"#4:%1'%$"3"29"%10'-"%/0'%#$"%2*O,$"4I%P0"$"('$"E%/"%#$"%#-@2*+%10'-"%/0'%#$"%-23@%1'%$"8'$1%1'%10#1%0'-821#)IQ%%

idibon!

Local knowledge !

Workers collaborating to find locations: Dalila: I need Thomassin Apo please

Apo: Kenscoff Route: Lat: 18.495746829274168, Long:-72.31849193572998

Apo: This Area after Petion-Ville and Pelerin 5 is not on Google Map. We have no streets name

‘here’ = anywhere

Feedback from responders: "just got emergency SMS, child delivery, USCG are acting, and, the GPS coordinates of the location we got from someone of your team were 100% accurate!"

(18.4957, -72.3185)

Apo

Dalila

Haiti responders

The ability for someone to make a real-time difference at any other place in the world:

Apo: I know this place like my pocket

Dalila: thank God u was here

idibon!

How do we automate processing the world’s data? !

idibon!

English!

Generations of standardization in spelling and simple morphology!

Whole words suitable as features for NLP systems!Most other languages!

Relatively complex morphology!Less (observed) standardized spellings!More dialectal variation !

idibon!

Haitian Krèyol !

No standard (wide-spread) spellings!More or less French spellings!More or less phonetic spellings!

Frequent words (esp pronouns) are shortened and compounded!

Regional slang / abbreviations!

idibon!

Haitian Krèyol!

=R-2E%="-2E%%=R32E%="$32%%%

%L %# %8 % H %. %# % S % 1 % 2 % " %* %%T # 8 # : 2 - : " * %

idibon!

The extent of the subword variation!

>30 spellings of odwala (‘patient’) in Chichewa!>50% variants of ‘odwala’ occur only once in the data used here:!

Affixes and incorporation!‘kwaodwala’ -> ‘kwa + odwala’!‘ndiodwala’ -> ‘ndi odwala’ (official ‘ngodwala’ not present)!

Phonological/Orthographic!‘odwara’ -> ‘odwala’ !‘ndiwodwala’ -> ‘ndi (w) odwala’!

idibon!

Chichewa!

The word odwala (‘patient’) in 600 text-messages in Chichewa and the English translations!

idibon!

Modeling the variation gives accurate results!

!"#$$%&'!%$%!()*%+%,-./0'112!(+3/!22"/$2"#0#!245

!"#$%&'!%$%!(*%+%

!"#6$%6&'!6%/$%!6(*%+%

!"#6$%6$%!6%/&'!()*'+'

!"# 6&'!///$%!6(*%+%,7./!22"/$2"#0#!285

9%(2:;13/</7=2>'2?(/&;1/%#"8

!"# @'&'!# $%!(*%1%,-$3/*%!(/;&/$2"#0#!245

!"# @'&'!# $%!(*%+%

!"#6@'6&'!6# $%!6(*%+%

!"#6@'6$%!6#&'!()*'+'

!"# 6&'!///$%!6(*%+%,7./!22"/$2"#0#!285

9%(2:;13/</7=2>'2?(/&;1/%#"8

A5/B;1$%+#C2/?D2++#!:?//

E5/F2:$2!(/

G5/."2!(#&3/D12"#0(;1?

A/#!/H/0+%??#&#0%(#;!/211;1?/*#()/1%*/$2??%:2?

A/#!/!" 0+%??#&#0%(#;!/211;1/D;?(6D1;02??#!:I/#$%&'()**#()/?0%+2I

idibon!

Comparison with English!>33,$#3:U%G

23$'H(%

6"$3"*1%'(%1$#2*2*+%4#1#%

idibon!

Taking it to the world!

idibon!

The benefits of understanding everyone!

.,=#*%42-"#-"- %"$#423#1"4% 2* % 10"% )#-1 %VW%:"#$- U % %

X*3$"#-"% 2* %# 2 $ % 1 $#9") % 2* % 10"% )#-1 %VW%:"#$- U %

- = # ) ) 8 ' Y %

idibon!

Reports of ‘strange new illnesses’ pre-date official records !

.CZC%[K/2*"%\),]%='*10-%

[CF^%'(%/'$)4%2*("31"4]%.X_%

4"3#4"-%[`W%=2) ) 2'*%2*("31"4]%

.CZW%[a2$4%\),]% %/""@-%

[bWF^%(#1#)] %

idibon!

…but the reports are in 1000s of languages !

cF^%'( %"3')'+23# ) %4 29"$- 2 1: % cF^%'( % ) 2*+,2-A3%4 29"$- 2 1:%

-

!"""" # $"""" % &"""" #&""" ' (""" )"""" *"""" ' + , ("""" -"""" ."""" / + +!0"""" 1"""" # 2 3

� 1 H 5 N 1

idibon!

Crowdsourcing, big data, and expert analysts!

-

!"""" # $"""" % &"""" #&""" ' (""" )"""" *"""" ' + , ("""" -"""" ."""" / + +!0"""" 1"""" # 2 3

� 1 H 5 N 1

G'-1%2*('$=#A'*%2-% 2*%8)#2*% )#*+,#+"U% %G,)A8)"%-@2 ) ) %#*4%8$'3"--2*+%-1$#1"+2"-%$"?,2$"4I%%

idibon!

Digital Disease Discovery!

a2+ %d#1#%=#30 2*" % ) "# $* 2*+ U %" Y 1 $#3A'* E % e ) 1 " $ 2 *+ %

f %8 $ 2' $ 2A g#A'* %

idibon

h)'<#) %='*21'$ 2*+%K# (" $ %/'$ )4 %

>*#):-1- % %-"9" $# ) %4'=# 2* %"Y8"$ 1 - %

L$'/4-',$3 2*+%10',-#*4- %' ( %

*#A9" H ) #*+,#+" %-8"#@"$ - %

i"8'$1- %= 2 ) ) 2 '*- %8" $ %4#: U %=#*: % ) #*+,#+"- E %

=,30 %*' 2 -" %

idibon!

The impact of scalable monitoring!

\',*4%02-1'$23#) %-2+*#)-%10#1%8$"H4#1"4%!""#$%&'$(&)*%+,-&<:%`%/""@-E%LdL%<:%W%j%'*%LZZ%

.'/%3#*%/"%e)1"$k='4") %="42#H4$ 29"*%#=8) 2e3#A'*5%%N X l=%B#3?,2 % B"$#- %/210 %1'4#:l- %3')4 %#*4%m,%$"8'$1 % I I I %#3$'-- % 10"%=24H>1 )#*A3%-1#1"- E %# % ) 2n)" %< 2 1 %'( %#*% 2*3$"#-"Q %

January 4, 2008 CNN Weather !% %

idibon!

The impact of scalable monitoring!

P$#3@"4%o<')#% 2*%;+#*4#%W%4#:-%<"('$"%p'$)4%."#)10%M$+#*2g#A'*I%

%N/"%/"$"%#<)" % 1' %8,) ) % 2* %=,30%$ 230"$ %4#1# % ($'=%#% )#$+"$ %*,=<"$%'( %-',$3"-E %-' %/"%@*"/%*'1 % O,-1 %0'/%=#*:%8"'8)" %/"$"% 2*("31"4E %<,1 %/0#1 %@ 2*4%'( % 1 $#*-8'$1 % 10":%1''@%/0"*%10":%/"*1 % ( $'=%10"2$ %9 2 ) )#+"%1' % 10"%0'-821# ) % 2* % 10"%*"#$"-1 %=#2*%1'/*IQ %

."/%*0 &1(+*"2 &34&!%+%*5$ &6--%7/$8 &"+&9:;# &<505&5+,&!$"/5$ &<%=%$">7%+0? &

&6#%&@ &#%+,%* &@ &= ; $ $5#%&@ &A%+0 & 0" &B"->;05 $ I %P02- % 2 - %8"$-'*#) ): % 24"*A(:2*+%f%3',)4% )"#4%1' %8"$-"3,A'*I %M8"*%4#1# %/',)4%0#9"%# %*"+#A9"% 2=8#31 I %

idibon!

The impact of scalable monitoring!

P$#3@"4%oHL')2 %M,1<$"#@%2*%h"$=#*:%D%4#:-%<"('$"%oLdLI%

.'/%4'%/"%='A9#1"% 2*('$=#A'*%8$'3"-- 2*+5%%%%%%%%%G#$+2*- %#$" %-=#) ) U %'*): % ('$H8$'e1 %< 2+ %4#1# %#*4%3$'/4-',$3 2*+%3#*%0#9"%# %-,-1# 2*"4% 2=8#31 %

idibon!

Idibon’s current work!

Hurricane Sandy !Idibon’s CTO ran FEMA’s Aerial Damage Assessments. !We have >1,000,000 manual tags on communications.!

MIT Humanitarian Response Lab!Identifying reports about supply-line interruptions.!Research data from a combination of crowdsourcing and natural language processing !!

idibon!

Recommendations for language processing for social good!

Look beyond English!Inherent benefit understanding and support speakers of every language!!

Employ people in those languages!Crowdsourced workers speak 100s of languages, and want to use them!!

Embrace the variation!You can’t rely on consistent spellings, but you can learn to model the diversity!!

Thank you!!!!!!!!!

Robert Munro, PhD!CEO, Idibon!


Top Related