biomiss: language diversity of computing
TRANSCRIPT
The Language Diversity of Computing
Or, how to talk with a computer.
Jeremy Yang(Mgr., Systems & Programming)
Translational Informatics Div.Dept. of Internal MedicineUniversity of New Mexico
BioMISS -- Thursday, Oct 15, 2015 1
Language Diversity Examples
Python Perl Fortran C R
C++ Java Basic SQL Sparql
XML XSD XPath URLs bash
HTML HTTP ASCII UTF-8 regex
Scala ICD-10 Ruby OWL RDF
2
A Working Definition of “Language”
● Coherent symbology (symbolic system)
3
Languages: Some major advances
COBOL(1960) Sparql
(2008)
Java (1995) 4
1950
FORTRAN (1953)
1960 1970 1980 1990 2000 2010
SQL(1979)
C(1969)
C++ (1979)
Perl (1987)
Python (1989)
HTML (1990)
XML (1997)
RDF (1999)
Language merit vs. elitism
5
Why do we care about languages?
● Compatibility● Efficiency● Usability
● Knowledge representation
● Intelligence● Evolution
Naturellement!6
7
℅ Prof Harald Sack, Hasso Plattner Institute, U. Potsdam, MOOC: “Semantic Web Technologies”
Programming paradigms
Object Oriented● classes● instances● methods● ~ nouns
8
Functional● functions● routines● parameters● ~ verbs
Programming paradigms are language paradigms.
9
Object Oriented Example:
CDK = Chemistry Development Kit
Open source Java package & API
Computers have “evolved” from numerical calculators to knowledge processors.
Knowledge representation and processing via language!
10
Italian Music Terms
Choice of language should be guided by the domain.
Q: So what is the problem?A: Language gaps
CODE
JARGON
MEANING
“Interpretation”
MATH
11
Q: So what is the problem?A: Standards (so many!)
“Why can’t my iPhone talk to my ...”
● TV● Audio system● Car● Medical records
12
Q: So what is the problem?
A: Language shapes, empowers, limits thought. (Sapir-Whorf Hypothesis, aka Linguistic Relativity)
13
Q: So what is the problem?A: Abstraction
● Overgeneralizing● Reality is concrete!● But: abstraction organizes knowledge● (a feature, not a bug!)
14
“We think in generalities, but we live in detail.” -- Alfred North Whitehead
15
Abstraction: Shakespeare quotes
“Full of sound and fury, signifying nothing.”
16
"On to this one quicker than a jackrabbit on a hot date. Look at this finish! That is beyond world class."
"Braver than a matador in a pink tutu he was."
"Racing Santander’s butcher men tried to hack down Xavi. Xavi dancing over the combine harvesters that are coming after him."
“He could make an onion cry.” (on Lionel Messi) "Where the insane
becomes the routine with this man. He is nothing less than a ball whisperer."
Abstraction: Ray Hudson Quotes
17
“You campaign in poetry. You govern in prose.” - Mario Cuomo
But maybe all language is poetic.
Languages of Biomedical Knowledge
18
19
Which cirrhosis?Specificity?
http://apps.who.int/classifications/icd10
Translation and mapping terms
20
story
history
Our project:Illuminating the Druggable Genome (IDG)
$4.9M21
Illuminating the Druggable GenomeKnowledge Management Center (IDG-KMC)
Translational Informatics DivisionChief: Tudor Oprea, MD, PhD
IDG-KMC Workflow
22
IDG-KMC Collaborator Network
23
Slide ℅ Tudor Oprea
24
Heterogeneous data integration. Language diversity.
IDG-KMC Language Challenge:Case #1: Drug Nomenclature
25http://pasilla.health.unm.edu/tomcat/drugdb
IDG-KMC Language Challenge:Case #2:Disease Nomenclature
26
27
ICD Disease Ontology● The International Classification of
Diseases (ICD) is the standard diagnostic tool for epidemiology, health management and clinical purposes.
● WHO● Clinical emphasis ● Procedures (CM)● EMR● Versions
● The mission the Disease Ontology (DO) is to provide an open source ontology for the integration of biomedical data that is associated with human disease.
● Academic network● Research emphasis● Community driven● Continual updates
Disease nomenclature● Nosology, classification, ontology● 17k codes in ICD-9. 155k codes in ICD-10.● Implicit: Disease model of medicine
28
My recent Dx: Otitis
Disease vs. Condition vs. Symptom vs. Phenotype
29
℅ WebMD
30
IDG KMC: Gene expression vs. Tissues; Different sources, tissue terms.
IDG-KMC: TCRD - Target Central Research Db+------------+------------+--------+------+------------------------------------------------------------------+--------+-------+| doid | Disease | zscore | conf | Protein | idgfam | tdl |+------------+------------+--------+------+------------------------------------------------------------------+--------+-------+| DOID:13189 | Gout | 3.512 | 1.8 | Alpha-protein kinase 1 | Kinase | Tbio || DOID:13189 | Gout | 3.214 | 1.6 | Serine/threonine-protein kinase SIK1 | Kinase | Tchem || DOID:13189 | Gout | 2.922 | 1.5 | Melanocortin receptor 3 | GPCR | Tchem || DOID:13189 | Gout | 2.797 | 1.4 | Taste receptor type 2 member 30 | GPCR | Tbio || DOID:13189 | Gout | 2.576 | 1.3 | Taste receptor type 2 member 16 | GPCR | Tbio || DOID:13189 | Gout | 2.379 | 1.2 | Hepatocyte nuclear factor 4-gamma | NR | Tbio || DOID:13189 | Gout | 2.441 | 1.2 | Tyrosine-protein kinase SYK | Kinase | Tchem || DOID:13189 | Gout | 1.948 | 1.0 | cGMP-dependent protein kinase 2 | Kinase | Tchem || DOID:13189 | Gout | 1.798 | 0.9 | Pannexin-1 | IC | Tbio || DOID:13189 | Gout | 1.517 | 0.8 | Taste receptor type 2 member 38 | GPCR | Tbio || DOID:13189 | Gout | 1.565 | 0.8 | Transient receptor potential cation channel subfamily A member 1 | IC | Tclin || DOID:13189 | Gout | 1.531 | 0.8 | Transient receptor potential cation channel subfamily V member 1 | IC | Tclin || DOID:13189 | Gout | 1.388 | 0.7 | Adenosine kinase | Kinase | Tchem || DOID:13189 | Gout | 1.427 | 0.7 | Interleukin-1 receptor-associated kinase 1 | Kinase | Tchem || DOID:13189 | Gout | 1.375 | 0.7 | Transient receptor potential cation channel subfamily M member 3 | IC | Tbio || DOID:13189 | Gout | 1.255 | 0.6 | Free fatty acid receptor 4 | GPCR | Tchem || DOID:13189 | Gout | 1.231 | 0.6 | P2X purinoceptor 2 | IC | Tbio || DOID:13189 | Gout | 1.198 | 0.6 | Proto-oncogene tyrosine-protein kinase Src | Kinase | Tclin || DOID:13189 | Gout | 1.108 | 0.6 | Tribbles homolog 1 | Kinase | Tbio || DOID:13189 | Gout | 1.093 | 0.5 | Activin receptor type-1B | Kinase | Tchem || DOID:13189 | Gout | 1.048 | 0.5 | Transient receptor potential cation channel subfamily V member 2 | IC | Tbio |+------------+------------+--------+------+------------------------------------------------------------------+--------+-------+
Disease-gene associations via literature text mining. 31
32
Text mining, named entity recognition, term frequencyNatural language processing, Google, Watson, Siri, and the state of the art
Language Diversity of Computers
Final Thought:
“Can we talk?”*
℅ Joan Rivers, 1933-201433