literature mining and systems biology

68
Literature Mining and Systems Biology Lars Juhl Jensen EMBL

Upload: yaphet

Post on 14-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Literature Mining and Systems Biology. Lars Juhl Jensen EMBL. Why?. Overview. Information retrieval: finding the papers Entity recognition: identifying the substance(s) Information extraction: formalizing the facts Text mining: finding nuggets in the literature - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Literature Mining and Systems Biology

Literature Mining and Systems Biology

Lars Juhl Jensen

EMBL

Page 2: Literature Mining and Systems Biology

Why?

Page 3: Literature Mining and Systems Biology

Overview

• Information retrieval: finding the papers

• Entity recognition: identifying the substance(s)

• Information extraction: formalizing the facts

• Text mining: finding nuggets in the literature

• Integration: combining text and biological data

Page 4: Literature Mining and Systems Biology

Status

• IR, ER, and simple IE methods are fairly well established

• Advanced NLP-based IE systems are rapidly being improved

• Methods for text mining and text/data integration are still in their infancy

Page 5: Literature Mining and Systems Biology

Example

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 6: Literature Mining and Systems Biology

Information retrieval

• Ad hoc information retrieval The user enters a query/a set of keywords The system attempts to retrieve the relevant texts from

a large text corpus (typically Medline)

• Text categorization A training set of texts is created in which texts are

manually assigned to classes (often only yes/no) A machine learning methods is trained to classify texts This method can subsequently be used to classify a

much larger text corpus

Page 7: Literature Mining and Systems Biology

Ad hoc IR

• These systems are very useful since the user can provide any query The query is typically Boolean (yeast AND cell cycle) A few systems instead allow the relative weight of each

search term to be specified by the user

• The art is to find the relevant papers even if they do not actually match the query Ideally our example sentence should be extracted by

the query yeast cell cycle although none of these words are mentioned

Page 8: Literature Mining and Systems Biology
Page 9: Literature Mining and Systems Biology
Page 10: Literature Mining and Systems Biology
Page 11: Literature Mining and Systems Biology
Page 12: Literature Mining and Systems Biology

Automatic query expansion

• In a typical query, the user will not have provided all relevant words and variants thereof

• By automatically expanding queries with additional search terms, recall can be improved Stemming removes common endings (yeast / yeasts) Thesauri can be used to expand queries with synonyms

and/or abbreviations (yeast / S. cerevisiae) The next logical step is to use ontologies to make

complex inferences (yeast cell cycle / Cdc28 )

Page 13: Literature Mining and Systems Biology
Page 14: Literature Mining and Systems Biology

Document similarity

• The similarity of two documents can be defined based on their word content Each document can be represented by a word vector Words should be weighted based on their frequency

and background frequency The most commonly used scheme is tf*idf weighting

• Document similarity can be used in ad hoc IR Rather than matching the query against each document

only, the N most similar documents are also considered

Page 15: Literature Mining and Systems Biology

Document clustering

• Unsupervised clustering algorithms can be applied to a document similarity matrix All pairwise document similarities are calculated Clusters of “similar documents” can be constructed

using one of numerous standard clustering methods

• Practical uses of document clustering The “related documents” function in PubMed Logical organization of the documents found by IR

Page 16: Literature Mining and Systems Biology

Text categorization

• These systems are a lot less flexible than ad hoc systems but can attain better accuracy Works on a pre-defined set of document classes Each class is defined by manually assigning a number

of documents to it

• Method Rules may be manually crafted based on a very small

set of manually classified documents Statistical machine learning methods can be trained on

a large number of classified documents

Page 17: Literature Mining and Systems Biology

Example

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Hints in the text Strong: Cdc28 and Swe1 (“cell cycle” and “yeast”) Weaker: mitotic cyclin, Clb2, and Cdk1 ( “cell cycle)

Page 18: Literature Mining and Systems Biology

Machine learning

• Input features Word content or bi-/tri-grams Part-of-speech tags Filtering (stop words, part-of-speech) Singular value decomposition

• Training Support vector machines are best suited Choice of kernel function Separate training and evaluation sets, cross validation

Page 19: Literature Mining and Systems Biology

Entity recognition

• An important but boring problem The genes/proteins/drugs mentioned within a given text

must be identified

• Recognition vs. identification Recognition: find the words that are names of entities Identification: figure out which entities they refer to Recognition without identification is of limited use

Page 20: Literature Mining and Systems Biology

Example

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Entities identified S. cerevisiae proteins: Clb2 (YPR119W), Cdc28

(YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)

Page 21: Literature Mining and Systems Biology

Recognition

• Features Morphological: mixes letters and digits or ends on -ase Context: followed by “protein” or “gene” Grammar: should occur as a noun

• Methodologies Manually crafted rule-based systems Machine learning (SVMs)

• But what can it be used for?

Page 22: Literature Mining and Systems Biology

Identification

• A good synonyms list is the key Combine many sources Curate to eliminate stop words

• Flexible matching to handle orthographic variation Case variation: CDC28, Cdc28, and cdc28 Prefixes: myc and c-myc Postfixes: Cdc28 and Cdc28p Spaces and hyphens: cdc28 and cdc-28 Latin vs. Greek letters: TNF-alpha and TNFA

Page 23: Literature Mining and Systems Biology

Disambiguation

• The same word may mean many different things Entity names may also be common English words

(hairy) or technical terms (SDS) Protein names may refer to related or unrelated proteins

in other species (cdc2)

• The meaning can be resolved from the context ER can distinguish between names and common words Disambiguating non-unique names is a hard problem Ambiguity between orthologs can be safely be ignored

Page 24: Literature Mining and Systems Biology
Page 25: Literature Mining and Systems Biology
Page 26: Literature Mining and Systems Biology

Co-occurrence extraction

• Relations are extracted for co-occurring entities Relations are always symmetric The type of relation is not given

• Scoring the relations More co-occurrences more significant Ubiquitous entities less significant Same sentence vs. same paragraph

• Simple, good recall, poor precision

Page 27: Literature Mining and Systems Biology

Example

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Relations Correct: Clb2–Cdc28, Clb2–Swe1, Cdc28–Swe1, and

Cdc5–Swe1 Wrong: Clb2–Cdc5 and Cdc28–Cdc5

Page 28: Literature Mining and Systems Biology
Page 29: Literature Mining and Systems Biology

Categorization of relations

• Extracting specific types of relations Text categorization methods can be used to identify

sentences that mention a certain type of relations Filtering can be done before or after relation extraction

• Well suited for database curation Text categorization can be reused High recall is most important Curators can compensate for the lack of precision

Page 30: Literature Mining and Systems Biology

Relation extraction by NLP

• Information is extracted based on parsing and interpreting phrases or full sentences Good at extracting specific types of relations Handles directed relations

• Complex, good precision, poor recall

Page 31: Literature Mining and Systems Biology

Example

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Relations: Complex: Clb2–Cdc28 Phosphorylation: Clb2Swe1, Cdc28Swe1, and

Cdc5Swe1

Page 32: Literature Mining and Systems Biology

An NLP architecture

• Tokenization Entity recognition with synonyms list Word boundaries (multi words) Sentence boundaries (abbreviations)

• Part-of-speech tagging TreeTagger trained on GENIA

• Semantic labeling Dictionary of regular expressions

• Entity and relation chunking Rule-based system implemented in CASS

Page 33: Literature Mining and Systems Biology

Semantic labeling Gene and protein names Cue words for entity recognition Cue words for relation extraction

Named entity chunking A CASS grammar recognizes

noun chunks related to gene expression:[nxgene The GAL4 gene]

Relation chunking Our CASS grammar also extracts

relations between entities:[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 34: Literature Mining and Systems Biology

[expression_repression_active

Btkregulatesthe IL-2 gene]

[dephosphorylation_nominal

Dephosphorylation ofSyk and Btkmediated by

SHP-1]

[phosphorylation_nominal

phosphorylation of Shc bythe hematopoietic cell-specific

tyrosine kinase Syk]

[phosphorylation_nominal

the phosphorylation ofthe adapter protein SHCby the Src-related kinase Lyn]

[phosphorylation_active

Lynalso participates in[phosphorylation the tyrosine phosphorylationand activation of syk]]

[phosphorylation_active

Lyn, [negation but not Jak2]phosphorylatedCrkL]

[phosphorylation_active

Lyn, [negation but not Jak2]phosphorylatedCrkL]

[phosphorylation_active

Lynalso participates in[phosphorylation the tyrosine phosphorylationand activation of syk]]

[phosphorylation_nominal

the phosphorylation ofthe adapter protein SHCby the Src-related kinase Lyn]

[phosphorylation_nominal

phosphorylation of Shc bythe hematopoietic cell-specific

tyrosine kinase Syk]

[dephosphorylation_nominal

Dephosphorylation ofSyk and Btkmediated by

SHP-1]

[expression_repression_active

IL-10also decreased

[expression mRNA expression of IL-2 and IL18 cytokine receptors]

[expression_repression_active

IL-10also decreased

[expression mRNA expression of IL-2 and IL18 cytokine receptors]

[expression_activation_passive

[expression IL-13 expression]induced by

IL-2 + IL-18]

[expression_activation_passive

[expression IL-13 expression]induced by

IL-2 + IL-18]

[expression_repression_active

Btkregulatesthe IL-2 gene]

Page 35: Literature Mining and Systems Biology
Page 36: Literature Mining and Systems Biology

Mining text for nuggets

• New relations can be inferred from published ones This can lead to actual discoveries if no person knows

all the facts required for making the inference Combining facts from disconnected literatures

• Swanson’s pioneering work Fish oil and Reynaud's disease Magnesium and migraine

Page 37: Literature Mining and Systems Biology
Page 38: Literature Mining and Systems Biology
Page 39: Literature Mining and Systems Biology

Trends

• Most similar to existing data mining approaches Although all the detailed data is in the text, people may

have missed the big picture

• Temporal trends Historical summaries Forecasting

• Correlations “Customers who bought this item also bought …”

Page 40: Literature Mining and Systems Biology

Time

Page 41: Literature Mining and Systems Biology

Buzzwords

Page 42: Literature Mining and Systems Biology

Correlations

• “Customers who bought this item also bought …”

• Protein networks “Proteins that regulate

expression …” “Proteins that control

phosphorylation …” “Proteins that are

phosphorylated …”

• Co-author networks

Page 43: Literature Mining and Systems Biology

Transcriptional networks

3279 83

3592

Regulates Regulated

P < 910-9

Page 44: Literature Mining and Systems Biology

Signaling pathways

1127 44

3704

Phosphorylates Phosphorylated

P < 210-7

Page 45: Literature Mining and Systems Biology

Integration

• Automatic annotation of high-throughput data Loads of fairly trivial methods

• Protein interaction networks Can unify many types of interactions Powerful as exploratory visualization tools

• More creative strategies Identification of candidate genes for genetic diseases Linking genes to traits based on species distributions

Page 46: Literature Mining and Systems Biology
Page 47: Literature Mining and Systems Biology
Page 48: Literature Mining and Systems Biology
Page 49: Literature Mining and Systems Biology
Page 50: Literature Mining and Systems Biology
Page 51: Literature Mining and Systems Biology

RCCs

Page 52: Literature Mining and Systems Biology

Disease candidate genes

• Rank the genes within a chromosomal region to which a disease has been mapped

• Methods G2D

• GeneFunctionChemicalPhenotypeDisease

• Uses MEDLINE but not the text BITOLA

• GeneWordsDisease (similar to ARROWSMITH)

Hide and co-workers• GeneTissueDisease

Page 53: Literature Mining and Systems Biology

G2D

Page 54: Literature Mining and Systems Biology
Page 55: Literature Mining and Systems Biology

Genotype–phenotype

• Genes can be linked to traits by comparing the species distributions of both Mainly works for prokaryotes Traits are represented by keywords

• Finding the species profiles Gene profiles are found by sequence similarity Keyword profiles are based co-occurrence with the

species name in MEDLINE

Page 56: Literature Mining and Systems Biology
Page 57: Literature Mining and Systems Biology
Page 58: Literature Mining and Systems Biology

Annotation

• Many experiment result in groups of related genes ER is used to find the associated abstracts The frequency of each word is counted in the abstracts Background frequencies of all words are pre-calculated A statistical test is used to rank the words

• The same strategy can be applied to find MeSH terms associated with a gene cluster

• Most people prefer using GO annotation instead

Page 59: Literature Mining and Systems Biology

Outlook

• Literature mining will not be made obsolete by <insert your favorite new technology here> Repositories are always made too late There will always be new types of relations Semantically tagged XML may replace ER (hopefully!) Semantically tagged XML will never tag everything

• Specific IE problems will become obsolete Protein function Physical protein interactions

Page 60: Literature Mining and Systems Biology

Permission denied

• Open access Literature mining methods cannot retrieve, extract, or

correlate information from text unless it is accessible Restricted access is already now the primary problem

• Standard formats Getting the text out of a PDF file is not trivial Many journals now store papers in XML format

• Where do I get all the patent text?!

Page 61: Literature Mining and Systems Biology

Innovation

• The basic tools are now in place for IR, ER, and IE Development was driven by

computational linguists

• Text- and data-mining Biologists are needed Collaboration with linguists

• Lack of innovation Very few new ideas Text should be combined

with other data

Page 62: Literature Mining and Systems Biology

Acknowledgments

• EML Research Jasmin Saric Isabel Rojas

• EMBL Heidelberg Peer Bork Miguel Andrade Michael Kuhn Rossitza Ouzounova Jan Korbel Tobias Doerks

Page 63: Literature Mining and Systems Biology

Exercises

Lars Juhl Jensen

EMBL

Page 64: Literature Mining and Systems Biology

Entity recognition

• iHOP http://www.pdg.cnb.uam.es/UniPub/iHOP/

• Ideas Compare iHOP vs. PubMed for finding papers related

to a particular gene Use iHOP to construct a small literature-based network

Page 65: Literature Mining and Systems Biology

Information extraction

• Relation extraction iProLINK (http://pir.georgetown.edu/iprolink/) PreBIND (http://prebind.bind.ca) PubGene (http://www.pubgene.org)

• Ideas Check how complex sentences iProLINK can handle Check how well PreBIND can discriminate between

physcial and other interactions (other interactions can be found with PubGene, ProLinks, or STRING)

Page 66: Literature Mining and Systems Biology

Text mining

• ARROWSMITH

http://arrowsmith.psych.uic.edu

• Ideas Fish oil and Reynaud's disease Magnesium and migraine Arginine and somatomedin C Estrogen and Alzheimer's disease

Page 67: Literature Mining and Systems Biology

Integration 1

• Protein networks STRING (http://string.embl.de) ProLinks (http://dip.doe-mbi.ucla.edu/pronav/)

• Ideas Use both tools to find functions for proteins of known

and unknown function Use STRING to construct a network for a set of proteins Try to reproduce the Ssn3–Msn2–Hsp104 link

Page 68: Literature Mining and Systems Biology

Integration 2

• Finding candidate disease genes G2D (http://www.ogic.ca/projects/g2d_2/) BITOLA (http://www.mf.uni-lj.si/bitola/)

• Ideas Take a look at the G2D results for some diseases where

you know which types of genes would be sensible to suggest

Compare the results with BITOLA (if you have the patience to figure out there interface!)