bringing data science to the speakers of every … data science to the speakers of every...
TRANSCRIPT
idibon!
About me: technology and global development!
CEO, Idibon!Global text analytics in 50+ languages!Working with leaders in industry & social good!
Industry: CTO / CIO!Energy infrastructure in Liberia and Sierra Leone!Global epidemic tracking!Crowdsourcing and natural language processing for disaster response!
Other!Ph.D. in NLP from Stanford!Bicycled 20+ countries!
idibon!
Recommendations for language processing for social good!
Look beyond English!Inherent benefit understanding and support speakers of every language!!
Employ people in those languages!Crowdsourced workers speak 100s of languages, and want to use them!!
Embrace the variation!You can’t rely on consistent spellings, but you can learn to model the diversity!!
idibon!
How many languages are in the connected world?!
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720540 500
!"#$%
&%'(%)#*+,#+"-%
.'/%0#-%102-%30#*+"45%
idibon!
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720540 500
How many languages are in the connected world?!
.'/%0#-%102-%30#*+"45%
!"#$%
&%'(%)#*+,#+"-%
idibon!
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720540 500
How many languages are in the connected world?!
!"#$%
&%'(%)#*+,#+"-%
idibon!
How many languages are in the connected world?!
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720540 500
!"#$%
&%'(%)#*+,#+"-%
6,7*+%#%80'*"%2*%10"%0#*4-%'(%"9"$:'*"%'*%10"%8)#*"1%2-%10"%"#-:%8#$1%
;*4"$-1#*42*+%"9"$:'*"%2-%+'2*+%1'%<"%='$"%3'=8)23#1"4%
idibon!
If every online picture is worth a thousand words, it would double social media.!
Every picture!
idibon!
Every 3 months, the world's text messages exceed the word count of every book.!
Every book. Ever.!
Source: Google Books
idibon!
The Twitter “firehose” is about the size of the dot above the “i” in English.!
!Beyond the processing capacity of most organizations.!!Might not be a representative sample of all human activity for your area of interest.!
idibon!
Short messages (SMS and IM) make up 2% of the world’s communications.!The largest and most linguistically diverse form of written communication that has ever existed.!!# PhDs focused on processing large volumes of short messages in low resource languages?!! ! ! ! ! ! !1!
idibon!
If the Facebook “like” is a one-word language it is in the top 5% of languages by word count.!
idibon!
Combined.!
Sundanese speakers outnumber the populations New York, London, Tokyo and Moscow. !
idibon!
You misread “Sundanese” as "Sudanese" which is a variety of Arabic !
We have a blind spot for knowing about the existence of languages.!
idibon!
January 12, 2010!>*%"#$10?,#@"%-1$,3@%.#2A%'*%B#*,#$:%CDE%DFCF%%
G'-1%)'3#)%-"$923"-%(#2)"4E%<,1%='-1%3"))H1'/"$-%$"=#2*"4%(,*3A'*#)I%%%
idibon!
Mission 4636!
!"##$%"&'($)#*$'"+,&
-$'"%.(/0"+&1&%".*.-$'"+
2.-$'/.)&/#&("3/)"+&1&
$-'/.)$4*"&/'"5#&$("&/+")'/3/"+
“Fanm gen tranche pou fè yon pitit nan Delmas 31”
“Fanm gen tranche pou fè yon pitit nan Delmas 31”
Undergoing children delivery Delmas 31
18.495746829274168, 72.31849193572998
Emergency
(18.4957, -72.3185)(18.4957, -72.3185)
“Fanm gen tranche pou fè yon pitit nan Delmas 31”
Undergoing children delivery Delmas 31
18.495746829274168, 72.31849193572998
Emergency
idibon!
J'821#)%K#3$"HL'",$%@2%*#*%92)%M@#8E%8$"%8',%)2%$"-"9/#%=',*%=#)#4%"%)#8%=#*4"%8',%=',*%@2%=#)#4%:'%#)"%)#I%%NK#3$"HL'",$%.'-821#)%/0230%)'3#1"4%2*%102-%92))#+"%'(%M@#8%2-%$"#4:%1'%$"3"29"%10'-"%/0'%#$"%2*O,$"4I%P0"$"('$"E%/"%#$"%#-@2*+%10'-"%/0'%#$"%-23@%1'%$"8'$1%1'%10#1%0'-821#)IQ%%
idibon!
J'821#)%K#3$"HL'",$%@2%*#*%92)%M@#8E%8$"%8',%)2%$"-"9/#%=',*%=#)#4%"%)#8%=#*4"%8',%=',*%@2%=#)#4%:'%#)"%)#I%%NK#3$"HL'",$%.'-821#)%/0230%)'3#1"4%2*%102-%92))#+"%'(%M@#8%2-%$"#4:%1'%$"3"29"%10'-"%/0'%#$"%2*O,$"4I%P0"$"('$"E%/"%#$"%#-@2*+%10'-"%/0'%#$"%-23@%1'%$"8'$1%1'%10#1%0'-821#)IQ%%
idibon!
J'821#)%K#3$"HL'",$%@2%*#*%92)%M@#8E%8$"%8',%)2%$"-"9/#%=',*%=#)#4%"%)#8%=#*4"%8',%=',*%@2%=#)#4%:'%#)"%)#I%%NK#3$"HL'",$%.'-821#)%/0230%)'3#1"4%2*%102-%92))#+"%'(%M@#8%2-%$"#4:%1'%$"3"29"%10'-"%/0'%#$"%2*O,$"4I%P0"$"('$"E%/"%#$"%#-@2*+%10'-"%/0'%#$"%-23@%1'%$"8'$1%1'%10#1%0'-821#)IQ%%
idibon!
Local knowledge !
Workers collaborating to find locations: Dalila: I need Thomassin Apo please
Apo: Kenscoff Route: Lat: 18.495746829274168, Long:-72.31849193572998
Apo: This Area after Petion-Ville and Pelerin 5 is not on Google Map. We have no streets name
‘here’ = anywhere
Feedback from responders: "just got emergency SMS, child delivery, USCG are acting, and, the GPS coordinates of the location we got from someone of your team were 100% accurate!"
(18.4957, -72.3185)
Apo
Dalila
Haiti responders
The ability for someone to make a real-time difference at any other place in the world:
Apo: I know this place like my pocket
Dalila: thank God u was here
idibon!
English!
Generations of standardization in spelling and simple morphology!
Whole words suitable as features for NLP systems!Most other languages!
Relatively complex morphology!Less (observed) standardized spellings!More dialectal variation !
idibon!
Haitian Krèyol !
No standard (wide-spread) spellings!More or less French spellings!More or less phonetic spellings!
Frequent words (esp pronouns) are shortened and compounded!
Regional slang / abbreviations!
idibon!
Haitian Krèyol!
=R-2E%="-2E%%=R32E%="$32%%%
%L %# %8 % H %. %# % S % 1 % 2 % " %* %%T # 8 # : 2 - : " * %
idibon!
The extent of the subword variation!
>30 spellings of odwala (‘patient’) in Chichewa!>50% variants of ‘odwala’ occur only once in the data used here:!
Affixes and incorporation!‘kwaodwala’ -> ‘kwa + odwala’!‘ndiodwala’ -> ‘ndi odwala’ (official ‘ngodwala’ not present)!
Phonological/Orthographic!‘odwara’ -> ‘odwala’ !‘ndiwodwala’ -> ‘ndi (w) odwala’!
idibon!
Chichewa!
The word odwala (‘patient’) in 600 text-messages in Chichewa and the English translations!
idibon!
Modeling the variation gives accurate results!
!"#$$%&'!%$%!()*%+%,-./0'112!(+3/!22"/$2"#0#!245
!"#$%&'!%$%!(*%+%
!"#6$%6&'!6%/$%!6(*%+%
!"#6$%6$%!6%/&'!()*'+'
!"# 6&'!///$%!6(*%+%,7./!22"/$2"#0#!285
9%(2:;13/</7=2>'2?(/&;1/%#"8
!"# @'&'!# $%!(*%1%,-$3/*%!(/;&/$2"#0#!245
!"# @'&'!# $%!(*%+%
!"#6@'6&'!6# $%!6(*%+%
!"#6@'6$%!6#&'!()*'+'
!"# 6&'!///$%!6(*%+%,7./!22"/$2"#0#!285
9%(2:;13/</7=2>'2?(/&;1/%#"8
A5/B;1$%+#C2/?D2++#!:?//
E5/F2:$2!(/
G5/."2!(#&3/D12"#0(;1?
A/#!/H/0+%??#�%(#;!/211;1?/*#()/1%*/$2??%:2?
A/#!/!" 0+%??#�%(#;!/211;1/D;?(6D1;02??#!:I/#$%&'()**#()/?0%+2I
idibon!
The benefits of understanding everyone!
.,=#*%42-"#-"- %"$#423#1"4% 2* % 10"% )#-1 %VW%:"#$- U % %
X*3$"#-"% 2* %# 2 $ % 1 $#9") % 2* % 10"% )#-1 %VW%:"#$- U %
- = # ) ) 8 ' Y %
idibon!
Reports of ‘strange new illnesses’ pre-date official records !
.CZC%[K/2*"%\),]%='*10-%
[CF^%'(%/'$)4%2*("31"4]%.X_%
4"3#4"-%[`W%=2) ) 2'*%2*("31"4]%
.CZW%[a2$4%\),]% %/""@-%
[bWF^%(#1#)] %
idibon!
…but the reports are in 1000s of languages !
cF^%'( %"3')'+23# ) %4 29"$- 2 1: % cF^%'( % ) 2*+,2-A3%4 29"$- 2 1:%
-
!"""" # $"""" % &"""" #&""" ' (""" )"""" *"""" ' + , ("""" -"""" ."""" / + +!0"""" 1"""" # 2 3
� 1 H 5 N 1
idibon!
Crowdsourcing, big data, and expert analysts!
-
!"""" # $"""" % &"""" #&""" ' (""" )"""" *"""" ' + , ("""" -"""" ."""" / + +!0"""" 1"""" # 2 3
� 1 H 5 N 1
G'-1%2*('$=#A'*%2-% 2*%8)#2*% )#*+,#+"U% %G,)A8)"%-@2 ) ) %#*4%8$'3"--2*+%-1$#1"+2"-%$"?,2$"4I%%
idibon!
Digital Disease Discovery!
a2+ %d#1#%=#30 2*" % ) "# $* 2*+ U %" Y 1 $#3A'* E % e ) 1 " $ 2 *+ %
f %8 $ 2' $ 2A g#A'* %
idibon
h)'<#) %='*21'$ 2*+%K# (" $ %/'$ )4 %
>*#):-1- % %-"9" $# ) %4'=# 2* %"Y8"$ 1 - %
L$'/4-',$3 2*+%10',-#*4- %' ( %
*#A9" H ) #*+,#+" %-8"#@"$ - %
i"8'$1- %= 2 ) ) 2 '*- %8" $ %4#: U %=#*: % ) #*+,#+"- E %
=,30 %*' 2 -" %
idibon!
The impact of scalable monitoring!
\',*4%02-1'$23#) %-2+*#)-%10#1%8$"H4#1"4%!""#$%&'$(&)*%+,-&<:%`%/""@-E%LdL%<:%W%j%'*%LZZ%
.'/%3#*%/"%e)1"$k='4") %="42#H4$ 29"*%#=8) 2e3#A'*5%%N X l=%B#3?,2 % B"$#- %/210 %1'4#:l- %3')4 %#*4%m,%$"8'$1 % I I I %#3$'-- % 10"%=24H>1 )#*A3%-1#1"- E %# % ) 2n)" %< 2 1 %'( %#*% 2*3$"#-"Q %
January 4, 2008 CNN Weather !% %
idibon!
The impact of scalable monitoring!
P$#3@"4%o<')#% 2*%;+#*4#%W%4#:-%<"('$"%p'$)4%."#)10%M$+#*2g#A'*I%
%N/"%/"$"%#<)" % 1' %8,) ) % 2* %=,30%$ 230"$ %4#1# % ($'=%#% )#$+"$ %*,=<"$%'( %-',$3"-E %-' %/"%@*"/%*'1 % O,-1 %0'/%=#*:%8"'8)" %/"$"% 2*("31"4E %<,1 %/0#1 %@ 2*4%'( % 1 $#*-8'$1 % 10":%1''@%/0"*%10":%/"*1 % ( $'=%10"2$ %9 2 ) )#+"%1' % 10"%0'-821# ) % 2* % 10"%*"#$"-1 %=#2*%1'/*IQ %
."/%*0 &1(+*"2 &34&!%+%*5$ &6--%7/$8 &"+&9:;# &<505&5+,&!$"/5$ &<%=%$">7%+0? &
&6#%&@ &#%+,%* &@ &= ; $ $5#%&@ &A%+0 & 0" &B"->;05 $ I %P02- % 2 - %8"$-'*#) ): % 24"*A(:2*+%f%3',)4% )"#4%1' %8"$-"3,A'*I %M8"*%4#1# %/',)4%0#9"%# %*"+#A9"% 2=8#31 I %
idibon!
The impact of scalable monitoring!
P$#3@"4%oHL')2 %M,1<$"#@%2*%h"$=#*:%D%4#:-%<"('$"%oLdLI%
.'/%4'%/"%='A9#1"% 2*('$=#A'*%8$'3"-- 2*+5%%%%%%%%%G#$+2*- %#$" %-=#) ) U %'*): % ('$H8$'e1 %< 2+ %4#1# %#*4%3$'/4-',$3 2*+%3#*%0#9"%# %-,-1# 2*"4% 2=8#31 %
idibon!
Idibon’s current work!
Hurricane Sandy !Idibon’s CTO ran FEMA’s Aerial Damage Assessments. !We have >1,000,000 manual tags on communications.!
MIT Humanitarian Response Lab!Identifying reports about supply-line interruptions.!Research data from a combination of crowdsourcing and natural language processing !!
idibon!
Recommendations for language processing for social good!
Look beyond English!Inherent benefit understanding and support speakers of every language!!
Employ people in those languages!Crowdsourced workers speak 100s of languages, and want to use them!!
Embrace the variation!You can’t rely on consistent spellings, but you can learn to model the diversity!!