transcriptomics and lexico-syntactic analysis

Transcriptomics and Lexico-syntactic Analysis(Yet another meaning of the TLA homonym)

Lars Juhl JensenEMBL Heidelberg

A brief history of TLA

• The joke started at the E-BioSci/ORIEL Annual Workshop– Barend and I (among few

others) gave somewhat provocative talks

– We afterwards discussed the homonym problem caused by excessive use of acronyms

– I told the TLA joke ...

• I got invited to give a talk here– Thanks!

– I was asked for the title of my talk and TLA was suggested

– Which meant I had to do some serious research on TLA ...

• http://www.acronymfinder.com– Three Letter Abbreviation

– Three Letter Acronym

– Telemetry Link Adapter

– Telephone Link Adapter

– Temporary Lodging Allowance

– Temporary Lodging Assistance

– Tennessee Library Association

– Term Loan A

– Terminal Low Altitude

– Texas Library Association

– Theater of the Living Arts

– Thin Layer Activation

– Three Letter Agency

– ...

http://www.acronymfinder.com/

http://www.acronymfinder.com/

The context – what I actually work on(When I’m not telling other people to work on my IE project)

• Construction and analysis of cross-species networks of functional associations

• This network will contain many types of edges– Assignment of orthologous/homologous genes– In silico links derived from genomic context (protein fusion,

phylogenetic profiles, and genomic co-localization)– Links supported by similar gene expression profiles– Protein interaction data from large-scale screens– Literature derived links extracted from Medline abstracts

• To do this we must be able to resolve the gazillion different names and identifiers for each gene in each species

“Biologists would rather share their toothbrush than share a gene name”

• Lists of synonymous identifiers and names were compiled from– SWISS-PROT/TrEMBL

– SGD, WormBase, and FlyBase

– BLAST search against UniGene

• Several types of identifiers– Various database identifiers

and accession numbers

– Gene symbols and gene names

• Lack of standardization– 8+ identifiers per yeast gene

– Many names refer to unrelated genes in different species

The synonyms and orthologs lists can be downloaded from:http://www.bork.embl.de/synonyms

Number of uniquely resolvable names for each species

Proteins Names Ratio Species specific

SWALL Uni-Gene

A. thaliana 25,957 138,976 5.3 118,818 20,158

C. elegans 20,348 110,602 5.4 45,835 65,749 18,214

D. melanogaster 16,871 103,208 6.1 22,707 77,757 14,072

H. sapiens 27,936 200,130 7.1 181,186 18,944

M. musculus 20,006 132,577 6.6 116,712 15,865

S. cerevisiae 6,210 48,291 7.7 18,702 40,038

Orthographic variations of gene names

• The list of gene names is automatically expanded to include the most common orthographic variations– A hyphen can be replaced by a space– “p” can used as postfix on gene names to signify proteins

• This orthographic expansion gives rise ~280,000 gene/protein names in yeast alone

• There is still quite a lot of orthographic variation missed– Multi-word gene names often cause trouble as word order can

sometimes change– Greek letters in names also give some problems

Retraining TreeTagger for Medline abstracts

• The English parameter file distributed with TreeTagger was trained on the UPenn Treebank

• We retrained TreeTagger on the manually annotated GENIA 3.0 corpus (466,179 tokens) adding gene names to the dictionary

• Performance of the two taggers was evaluated on 55,166 tokes not used during training

• Retraining eliminated more than half of all tagging errors

Tagging is really easy ... compared to extracting the information you are after

• Many ways to write the same thing– A activates the transcription of B– B transcription is induced by A– A is a transcriptional activator of B– Overexpression of A increases B mRNA levels– Transcription is enhanced when A binds to the B promoter– The B promoter contains an A UAS

• Multiple pieces of information and negations in a sentence– A is a transcriptional activator of B, C, D, E, and F– B was not suppressed by A– The A transcription factor affects B but not C– C phosphorylation of A leads to increased expression of B

A mini-ontology of transcription regulation

Entities (boxes)• generic (gray)• regulator (yellow)• activator (red)• repressor (green)• target (blue)

Relations (arrows)• is-a (black)• part-of (blue)

Events (arrows)• creates (green)• binds (red)

Gene

TranscriptGene

product

StableRNA

Promoter

Upstreamregulatoryelement

Upstreamactivatingsequence

Upstreamrepressingsequence

mRNA

Protein

Transcriptionregulator

Transcriptionactivator

Transcriptionrepressor

Codingsequence

rRNA tRNA

Parsing abstracts to identify relationships between genes/proteins

• Sentence and word boundaries are identified using Tokenizer

• Our retrained TreeTagger is used for tagging part-of-speech

• Abstracts are chunked with a custom CASS grammar to identify noun and verb chunks

• Noun chunks are categorized according to a mini-ontology

• Lexico-syntactic patterns are used to identify event chunks

SRN1 NNPG NXPGSG EVSUPVAcan MD |suppress SUPV |rna2 NNPG NXPGPL |rna3 NNPG | |rna4 NNPG | |rna5 NNPG | |rna6 NNPG | |and CC | |rna8 NNPG | |singly RBor CCin INpairs NNS

TIGERSearch is used for searching and browsing the large processed text corpus

Pattern recognize sentences in both active and passive voice

Typical results are shown

• “The expression of an FLR1 lacZ reporter construct is strongly induced by the overexpression of either CAP1 or YAP1, indicating that the FLR1 gene is transcriptionally regulated by the CAP1 and YAP1 proteins”

• “In addition , the mot3Delta mutation caused a partial derepression of the Mig1 Tup1 Ssn6 repressed SUC2 gene, but not the alpha2 Mcm1 Tup1 Ssn6 repressed STE2 gene“

• “We demonstrate here that overexpression of LRE1 represses CTS1 whereas deletion of LRE1 induces the expression of CTS1”

We can only wish that all biologists mention their results twice

• “The expression of the cytochrome genes CYC1 and CYC7 is controlled by HAP1 , which binds to dissimilar DNA sequences in UAS 1 of CYC1 and the UAS of CYC7”– 2 correct regulatory interactions identified– The second mentioning of the the same interactions is missed

• “A disruption of the SKO1 gene causes a partial derepression of SUC2 , indicating that SKO1 is a negative regulator of the SUC2 gene”– We correctly interpret their results and extract an activation– But we miss the following sentence where the results are explicitly

interpreted for us?!

Two out of three is not bad at all

• “Multicopy MEU1 expression suppressed the constitutive ADH2 expression caused by cre2 1. Disruption of MEU1 reduced endogenous ADH2 expression about twofold but had no effect on cell viability or growth”– Correctly identifies the suppression twice (despite the tokenizer

accidentially joining two sentences!)– We cannot handle overlapping events that are not fully embedded

• “Sin3p negatively regulates the INO1 , CHO1 , CHO2 and OPI3 genes while Ume6p negatively regulates the INO1 gene and positively regulates the other genes”– Correctly identifies 4 negative regulations from two events– Misses 3 positive regulations of “the other genes”

Life is unfair

• “The Uga43p factor negatively regulates GZF3 expression and vice versa”– Be happy for what you got ...

• “With wild type CDC28 , filament formation induced by CLN1 overexpression was markedly decreased in a SWE1 deletion”– ... even if you got it by pure luck

Why not extract phosphorylations while we are at it?

• “Loss of Hnt1 enzyme activity also leads to hypersensitivity to mutations in Ccl1 , Tfb3 , and Kin28 , which constitute the TFIIK kinase subcomplex of general transcription factor TFIIH and to mutations in Cak1 , which phosphorylates Kin28”

• “Consistent with the proposed model , Pkc1p selectively phosphorylates Bck1p in vitro and Mpk1p protein kinase activity requires a functional BCK1 gene”

Using text mining of Medline abstract to support predicted regulatory interactions

• By applying the scheme just described to all Medline abstracts, a set of regulatory interactions in multiple species is obtained

• We will use it to classify protein associations derived from– Microarray gene expression– Chromatin IP data– Physical protein interaction screens (e.g. Y2H and TAP)– Cross-species analysis of genomic context (STRING)

• To integrate all of these different data sources the list of synonymous gene names and identifiers is again needed as different data sets use different identifiers

Microarrays 101

• The level of expression in two samples can be compared for all genes simultaneously

• Each spot corresponds contains either cDNA or short probes specific to one gene

• The amount of labeled mRNA from a sample that hybridizes to each spot is measured as a fluorescence intensity

• Spotted microarrays are quite cheap compared to GeneChips

Non-linear normalization of intensities and correction for spatial effects

DownloadedSMD data

After intensitynormalization

Spatial biasestimate

After spatialnormalization

Combining arrays from multiple experiments into one gene expression matrix

• For each species, all arrays in SMD are merged– To ensure comparable data, all arrays were re-normalized– A matrix is constructed with each row being a gene and each

column an array

• This integration is complicated by the lack of consistency in the choice of gene identifiers even within SMD– To deal with this, a very large list of synonymous gene names and

identifiers was compiled based on SGD, WormBase, FlyBase, SWISS-PROT, and UniGene

– Such a list is also very useful for integrating the expression data with protein-protein interaction data, STRING, and text-mining of Medline abstracts

“And now we cluster correlated expression profiles ... no, wait a second!”

• Traditional clustering of genes with correlated expression profiles is not well suited for inferring functional links

• No appropriate distance measure– All arrays are not of the same quality– Not all experiments will be equally useful for inferring function– The arrays are not all from mutually independent experiments

• Multi-functional proteins– If (A,B) are in the same cluster and (A,C) also cluster together,

(B,C) will by definition be in the same cluster– Functional relations for the pairs (A,B) and (A,C) do not necessarily

imply a functional relation for (B,C) if A has two or more functions

Singular value decomposition – letting the data speak for themselves

• Singular value decomposition is run on the gene expression matrix– Defines an ordered set of non-

correlated basis vectors

– Each singular vector is a linear combination of arrays

• The first singular vectors effectively average over related arrays– Finds replicate arrays including

dye-swaps

– Adjacent arrays in time series and related experiments are combined

• The last vectors mainly contain noise, e.g. replicate differences

1 2 3 4 5 6 7 8

RNA stability

Starvation

Heat-shock

Salt treatment

Polysomes

Sporulation

Inferring functional links from projections of genes onto singular vectors

• Analyze each singular vector– Do 1D density estimation of

expression ratio projections for genes of known function

– 2D density estimation for pairs of functionally related genes

– Use Bayes’ law for estimating log-odds of functional link given a pair of projections

• Different types of regulation– Up- vs. down-regulation

– Anti-correlated expression

• The log-odds from the first N singular vectors are summed

Proteins linked to the human mitotic checkpoint protein BUB1

Identifier Description Comments

Q8WVP0 Kinesin-like 5 Mitotic kinesin-like protein 1

CDN3_HUMAN Cyclin-dependent kinase inhibitor 3

HMG2_HUMAN High mobility group protein 2

FXM1_HUMAN Forkhead box protein M1 Phosphorylated in M-phase

BASP_HUMAN Brain acid soluble protein 1 Associated with "growth cones"

MYBB_HUMAN Myb-related protein B Phosphorylated by Cdk2 during S-phase

O14731 Membrane-associated kinase Cell cycle regulated kinase, inhibits Cdc2

TP2A_HUMAN DNA topoisomerase II

MPI1_HUMAN M-phase inducer phosphatase 1

PMC1_HUMAN Polymyositis/scleroderma autoantigen 1

Q8N324 Hypothetical protein Contains a PRY and a SPRY domain

CGA2_HUMAN Cyclin A2

Q9NZJ0 L2DTL protein Contains six WD40 repeats

Q15003 HCAP-H protein

KF14_HUMAN Kinesin-like protein KIF14

Q96SE4 Kinesin-like protein 2

KNS2_HUMAN Kinesin-like protein 2

WEE1_HUMAN Wee1-like protein kinase May act as a negative regulator of entry into mitosis

CHK1_HUMAN Serine/threonine-protein kinase Chk1 Involved in cell cycle arrest

NEK2_HUMAN Serine/threonine-protein kinase NEK2 Involved in mitotic regulation

CKS1_HUMAN Cyclin-dependent kinases regulatory subunit

O14980 CRM1 protein Cell cycle-dependent expression

Tha-tha-tha-that’s all folks!

• I believe that literature mining methods are very useful and much need in the field of biology– It is simply not possible to read through all potentially interesting

papers being written on any but the most narrowly defined topics– However, it is not very useful if used alone

• Information extraction is particularly important for interpreting high-throughput experiments– The data sets do by nature ask global/broad questions– Yet they should be interpreted in the context of current knowledge

• We are working on several fronts on making a system that allow a unified overview of both data and literature

The STRING web service

• Relies on genomic context analysis of 110 species with a total of 440.000 genes

• Most of these are prokaryotes (8 eukaryotic genomes)

• Contains orthologous group assignment for 80% of the genes (50% for eukaryotes)

• Of those 70% are have links with >75% accuracy

STRING is accessible at:http://www.bork.embl.de/STRING

Honestly – it’s not my fault!

• The text mining people at EML– Jasmin Saric

– Isabel Rojas

• The STRING team– Christian von Mering

– Berend Snel

– Martijn Huynen

– Daniel Jaeggi

– Steffen Schmidt

– Peer Bork

• Microarray normalization– Chris Workman

• PROPHECIES web service– Julien Lagarde

• Web resources– www.bork.embl.de/STRING

(soon moving to string.embl.de)

– www.bork.embl.de/synonyms

http://www.bork.embl.de/STRING

http://www.bork.embl.de/STRING

http://string.embl.de/

http://www.bork.embl.de/synonyms

Questions?

transcriptomics and lexico-syntactic analysis

Education

list of gene names

yeast gene

multiword gene names

geneprotein names

resolvable names

thatthe flr1 gene

gazillion different

accession numbers gene