welcome to bcb4003/cs4803 bcb503/cs583 biological and biomedical database mining

11
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL AND BIOMEDICAL DATABASE MINING

Upload: karen-stephenson

Post on 30-Dec-2015

29 views

Category:

Documents


4 download

DESCRIPTION

Welcome to BCB4003/CS4803 BCB503/CS583 Biological and Biomedical Database Mining. Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI. Why this course?. Transcriptome mid 1990’s-2000’s Gene expression, DNA/RNA microarrays. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

Prof. Carolina Ruiz

Computer Science Department

Bioinformatics and Computational Biology Program

WPI

WELCOME TO

BCB4003/CS4803

BCB503/CS583

BIOLOGICAL AND BIOMEDICAL DATABASE MINING

Page 2: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

WHY THIS COURSE?

Biological and BiomedicalResearch Problems

Genome 1980’s-1990’sSequencing, sequence analysis, …

Proteome 1990’s-2000’s

Protein structure, protein-protein interactions, protein pathways

Central dogma: DNA (trascription) RNA (translation) Protein

Transcriptomemid 1990’s-2000’s Gene expression,

DNA/RNA microarrays

Biological Function

2000’s

Applications 2000’sOrganism-organism interactions

Organism-environment interactionsGenome-wide association studies

Cancer therapiesDrug development

Page 3: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

THIS ALL HAS GENERATED …

• Data• Massive datasets and databases of sequence, gene, gene

expression, protein, biological function, clinical information, …

• Text• Annotations in data sources, abstracts (e.g., Medline), research

articles, medical literature (e.g., PubMed, NCBI Bookshelf, Google Scholar), patients records, …

• Ontologies• Description of terms and their relationship

• (e.g., Gene Ontology)

Page 4: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

CURRENT CHALLENGES

• To make sense of and put to use all this information.

• How? Computational tools and techniques are needed to help humans in integrating, summarizing, understanding, and taking advantage of accumulated information• Data mining• Text mining• Data and text mining together

Page 5: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [text]” (Fayyad et al., 1996)

• Raw Data [Text] Data [Text] Mining

• Patterns

• Analytical Patterns (rules, decision trees)

• Statistical Patterns (data distribution)

• Visual Patterns

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.

WHAT IS DATA [TEXT] MINING?OR MORE GENERALLY, KNOWLEDGE DISCOVERY IN DATABASES (KDD)

Page 6: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

DATA MINING METHODS IN BIOINFORMATICS

• Clustering

• Sequence Mining

• Bayesian Methods

• Expectation Maximization (EM)

• Gibbs Sampling

• Hidden Markov Models

• Kernel methods

• Support Vector Machines

Page 7: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

TEXT MINING IN BIOINFORMATICS• Document indexing

• Information retrieval

• Lexical analysis (Sentence tokenization, Word tokenization, Stemming, Stop word removal)

• Semantic analysis

• Query processing

• Text classification

• Text clustering

• Text summarization

• (Semi-) Automatic curation of literature repositories

• Knowledge discovery from text, hypothesis generation

Page 8: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

0102030405060708090

1stQtr

2ndQtr

3rdQtr

4thQtr

East

West

North

DATA/TEXT MINING PROCESS (KDD)

information sources

data analysisdata mining• analytical• statistical• visual

models

model/patterns deployment• prediction

• decision supportnew data

data management• databases

• data warehouses“good” model

model/patternevaluation• quantitative• qualitative

data “pre”-processing

• noisy/missing data • feature selection

cleaneddata

data

Page 9: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

PUTTING ALL TOGETHER …

• Data / Text / Information Integration• Mining over data and text combined

• Visualization

• Other real-world issues• Developing tools and techniques that are

efficient, scalable, and user friendly

Page 10: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

• Biology and Biomedicine

• Contributes domain knowledge

• Machine Learning (AI)

• Contributes (semi-)automatic induction of empirical laws from observations & experimentation

• Statistics

• Contributes language, framework, and techniques

• Pattern Recognition

• Contributes pattern extraction and pattern matching techniques

• Natural Language Processing (AI) Computational Linguistics• Contributes text analysis techniques

• Databases• Contributes efficient data storage, data

cleansing, and data access techniques

• Data Visualization• Contributes visual data displays and

data exploration

• High Performance Comp.• Contributes techniques to efficiently

handling complexity

• Signal processing

• Image Processing …

INTERDISCIPLINARY TECHNIQUES COME FROM MULTIPLE FIELDS

Page 11: Welcome  to BCB4003/CS4803 BCB503/CS583  Biological and Biomedical Database Mining

QUESTIONS?

* Images in this presentation were downloaded from Google images