machine learning and bioinformatics: an introductiongabis/docdiplome/optimization_rl/...machine...

20
Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University Email: [email protected] http://cbit.snu.ac.kr/ http://bi.snu.ac.kr/ (c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT) 2 Outline What is machine learning? What techniques are available? What is bioinformatics? What problems are worth solving? What machine learning techniques are suitable for which bioinformatics problems? What does this tutorial course aim at?

Upload: others

Post on 27-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

Machine Learning and Bioinformatics:An Introduction

Byoung-Tak Zhang

Center for Bioinformation Technology (CBIT) &

Biointelligence Laboratory

School of Computer Science and Engineering

Seoul National University

Email: [email protected]

http://cbit.snu.ac.kr/

http://bi.snu.ac.kr/

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

2

Outline

What is machine learning? What techniques are available?

What is bioinformatics? What problemsare worth solving?

What machine learning techniques are suitable for which bioinformaticsproblems?

What does this tutorial course aim at?

Page 2: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

3

Machine Learning

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

4

A Definition of Learning

the improvement

of behavior

on some

performance task

through acquisition

of knowledge

based on partial

task experience

Page 3: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

5

What Is the Learning Problem?

Learning = improving with experience at some task♦ Improve over task T

♦ With respect to performance measure P

♦ Based on experience E

Example: Learn to play checkers♦ T: Play checkers

♦ P: % of games won in world tournament

♦ E: opportunity to play against self

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

6

Teacher and Learner: Learning System

EnvironmentSolution d

orReward r

Problem x

Teacher

Learner(Student) -

yx

feedback

Page 4: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

7

Machine Learning: Three Types

Supervised Learning♦ Estimate an unknown mapping from known input- output pairs♦ Learn fw from training set D={(x, d)} s.t.♦ Classification: y is discrete♦ Regression: y is continuous

Unsupervised Learning♦ Only input values are provided♦ Learn fw from D={(x)} s.t.♦ Compression♦ Clustering

Reinforcement Learning♦ Input + reward r are provided sequentially with possible delay♦ Learn fw from D={(x, r(x,y))} s.t. ♦ Maximize the total reward

)()( xxw fdyf ===

xxw =)(f

))},(() ,({max )( yfyrf y xxx ww σλ+=

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

8

Why Machine Learning?

Recent progress in algorithms and theoryGrowing flood of online dataComputational power is availableBudding industry

Three niches for machine learningData mining: using historical data to inprove decisions♦ Medical records -> medical knowledge

Software applications we can’t program by hand♦ Autonomous driving♦ Speech recognition

Self customizing programs♦ Newsreader that learns user interests

Page 5: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

9

Brief History of Machine Learning

1950’s: Samuels checker player1960’s: Neural networks, perceptron; pattern recognition; learning in the limit theory; Minsky & Papert1970’s: Symbolic concept induction; Winstons’s arch learner; knowledge acquisition bottleneck; Quinlan’s ID3; Michalski’s AQ and soybean diagnosis results; Scientific discovery with BACON; mathematical discovery with AM1980’s: Continued progress on decision-tree and rule learning; Explanation-based learning; speedup learning; utility problem, analogy; resurgence of connectionism (PDP, ANN); Valiant’s PAC learning; experimental evaluation1990’s: Data mining; adaptive software agents & IR; reinforcement learning; theory refinement; inductive logic programming; voting, bagging, boosting, and stacking; learning Bayesian networks

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

10

Possible Uses of Machine Learning

configuration

and design

planning and

scheduling

language

understanding

vision and

speech

execution

and controldiagnostic

reasoning

Page 6: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

11

Metaphors and Methods for Machine Learning

Neurobiology

Biological

Evolution

Heuristic

Search

Statistical

InferenceMemory and

Retrieval

Connectionist

Learning

Genetic Learning Tree / Rule

Induction

Case-Based

Learning

Probabilistic

Induction

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

12

Methods in Machine Learning (1/2)

Symbolic Learning♦ Version Space Learning♦ Case-Based Learning

Neural Learning♦ Multilayer Perceptrons (MLPs)♦ Self-Organizing Maps (SOMs)♦ Support Vector Machines (SVMs)

Evolutionary Learning♦ Evolution Strategies♦ Evolutionary Programming♦ Genetic Algorithms♦ Genetic Programming

Page 7: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

13

Methods in Machine Learning (2/2)

Probabilistic Learning♦ Bayesian Networks (BNs)

♦ Helmholtz Machines (HMs)

♦ Latent Variable Models (LVMs)

♦ Generative Topographic Mapping (GTM)

Other Machine Learning Methods♦ Decision Trees (DTs)

♦ Reinforcement Learning (RL)

♦ Boosting Algorithms

♦ Mixture of Experts (ME)

♦ Independent Component Analysis (ICA)

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

14

Example Applications Where ML Has Been Used (1/2)

Banking & Investment♦ Credit card fraud♦ Delinquent accounts♦ Authorization of purchases♦ Predict stock market

Health Care♦ Disease diagnosis♦ Managing resources♦ Look for causal realtionships between environment and disease

Marketing♦ Credit card applications♦ Use past buying habits to predict likelihood of customer purchasing

some new product

Textual datamaing

Page 8: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

15

Example Applications Where ML Has Been Used (2/2)

Manufacturing - Process control

Bioinformatics

Astronomy

Chemistry

Speech recognition

Machine learning methods applied to signal and image processing

Human Resources - Evaluating job performance

Insurance

Bioinformatics

Page 9: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

17

What Is DNA?

AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCGTTGCTTCG

GCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACAC

GAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGG

CATGCAATCAGTCCCGTTGCTTCGGCACTGTCTGAAAGCGCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCG

TTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCG

CCGGGGCTATTGTACCCGTTGCTTCGGATCTCTTGGGGATCTCTTGGTTCCGGCATGCAATCAGTCCCGTTGCTTCGGC

ACTGTCTGAAAGCGCCTTTGGGCCCAACCTCCCACCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGG

CGGCCGCCGGGGGCACTGTCTGAAAGCTCGGCCGCC

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

18

Molecular Biology: Central Dogma

Page 10: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

19

Molecular Biology: Flow of Information

DNA RNA Protein Function

DNA

PheCysLysCysAspCysAr

gSe

r AlaLeu

Protein

ACTGG

AAGCTTATC

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

20

Human Genome Project

Genome Health Implications

A New

Disease

Encyclopedia

New Genetic

Fingerprints

New

Diagnostics

New

Treatments

Goals• Identify the approximate 40,000 genes

in human DNA• Determine the sequences of the 3 billion

bases that make up human DNA• Store this information in database• Develop tools for data analysis• Address the ethical, legal and social

issues that arise from genome research

Page 11: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

21

Bioinformatics

What is a Bioinformatics?Bioinformatics is a new term referring to the discipline that employs computers to store, retrieve, analyze and assist in understanding biological information.

The application of information technology and computer science to the study of biological systems.The analysis of the massive (and constantly increasing) amount of genetic information Sophisticated computer technologies to enable discoveryin all fields of life sciences.

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

22

Topics in Bioinformatics

Structure analysis4 Protein structure comparison4 Protein structure prediction 4 RNA structure modeling

Pathway analysis4Metabolic pathway4 Regulatory networks

Sequence analysis4 Sequence alignment4 Structure and function prediction4 Gene finding

Expression analysis4 Gene expression analysis4 Gene clustering

Page 12: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

23

Structural Genomics

FunctionalGenomics

ProteomicsPharmaco-genomics

AGCTAGTTCAGTACA

TGGATCCATAAGGTA

CTCAGTCATTACTGC

AGGTCACTTACGATA

TCAGTCGATCACTAG

CTGACTTACGAGAGT

Microarray (Biochip)

Infrastructure of Bioinformatics

Areas and Workflow of Bioinformatics

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

24

Bioinformatics as Information Technology

Bioinformatics

Information

Retrieval

GenBankSWISS-PROT

Hardware

Agent

Machine

Learning

Algorithm

Supercomputing

Information filteringMonitoring agent

ClusteringRule discoveryPattern recognition

Sequence alignment

Biomedical text analysis

Database

Page 13: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

25

Why Bioinformatics?

Market segment Size (1998) Size (2002)

E-based business-to-business market $800 million $100.0 billionBusiness-to-business biomedical information market $300 million $1.0 billionPhamacogenomics data gathering and analysis alliances $1.0 billion $3.5 billionBiochip-based data gathering and analysis alliances $500 million $4.0 billion

Sources: Cognia (www.cognia.com); Biovista (www.biovista.com)

Bioinformatics Market Size

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

26

Bioinformatics as Business

Software offeringData visualization

Data managementGene and protein analysis

Data filtering and transformation

Clustering and classificationTools supporting laboratory experiment

Data offering

Business structure offering

DNA sequence dataGene expression data

Protein data

Medical genetics dataBiological text data

Networking and service solution

Supercomputer

High performance storage system

Page 14: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

27

Current Trend

UniversityResearchInstitute

Companies

Bioinformat-ics Center

Collaboration for Research and Development

Integration of multiple data sourcesDescription of causal relationshipsSimulation of biological processesPrediction of anomalyGeneration of hypothesesLiterature summary for automatic data collection

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

28

Major Companies and Their Areas of Interest

oMolecular Mining

o

o

Web-portal

ooLion Bioscience

o

PathwayAnalysis

o

o

o

SNP Analysis

o

Protein Analysis

o

EST-Clustering

Microarray Analysis

oCelera Paracel Inc.

oInformax

oSilicon Genetics

oRosetta Inpharmatics

eBioinformatics

DoubleTwist

oCompugen

AreasCompanies in Bioinformatics

Page 15: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

29

IT Companies Offering Bioinformatics Products

SGI offers visual computing and high-performance computer systems.SGI systems support wide variety of bioinformatics software applications.

Silicon Graphics

IBM is conducing research into high value-added data mining and protein structure determination methods. IBM offers a variety of enterprise-wide IT solutions for the life science market, and recently initiated a collaboration with NetGenics

IBM

In 1999, Agilent entered into a strategic collaboration with Rosetta Inpharmatics to make and sell gene expresion analysis systems, including hardware and software.

Description

Compaq has a majar strategic alliance with Celeara to provide integrated bioinformatics hardware, software, networking and service solutions.

Compaq

Sun systems support a wide variety of bioinformatics software applications Sun Microsystems

AgilentTechnologies

Companies

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

30

Bioinformatics Tools

2D Gel, MALDI-TOFHardwares for Proteomics

GeneX, GOE, MAT, GeNetDNA Microarray

Bend.it, RNA Draw, NNPREDICT, SWISS-MODEL

Structure Prediction

GRAIL, FGENEH, tRNAscan-SE, NNPP, eMOTIF, PROSITE, ChloroP

Pattern Finding

Clustal W, MacawMultiple Sequence Alignment

BLAST, FASTASequence Alignment

Tools or DatabasesProblems

Page 16: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

Machine Learning in Bioinformatics

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

32

General Scheme for Bio Data Mining

Data preprocessing:

- Normalization

- Discretization

- Gene selectionLearning:

- Greedy search

- EM algorithm

- Classification

- Clustering

- Dependency Analysis

Page 17: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

33

Machine Learning Applied to Bioinformatics

Sequence Alignment♦ Simulated Annealing♦ Genetic Algorithms

Structure and Function Prediction♦ Hidden Markov Models♦ Multilayer Perceptrons♦ Decision Trees

Molecular Clustering and Classification♦ Support Vector Machines♦ Nearest Neighbor Algorithms

Expression (DNA Chip Data) Analysis♦ Self-Organizing Maps♦ Bayesian Networks

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

34

Example: Gene Finding

UpstreamOpen Reading Frame

Downstream

mRNA

Page 18: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

35

General Learning Scheme

Training setAATGCGTACCTCATACGACCACAACGAATGAATATGATGT………

Test setTCGACTACGAGCCTCATCGACGAACGAATGAATATGATGT………

PredictionMethod

Learning (Model Construction)

Outputinput

input

output

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

36

Page 19: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

37

Neural Networks in GRAIL

CATATTCAAGAATTGAAGCGTGTAGTCCTGACTTGAGAGCTGTAGATGACGTGCTTATATGTTC………………………..

Known Sequence

0.7 0.8 0.1 0.3 … 0.9 0.2

0.4 0.2 0.6 0.1 … 0.4 0.5

x1x2

xn

0.2 0.9 0.3 0.1 … 0.8 0.3

0.6 0.3 0.2 0.8 … 0.2 0.4

Coding potential valueGC Composition

LengthDonor

Intron vocabulary

1

0

0

1

t1t2

tn

x3 t3

Exon

Input Layer

Hidden Layer

Output Layer

Weights

Training

Preprocessing

Testing

ATGACGTACGATCCCGTGACGGTGACGTGAGCTGACGTGCCGTCGTAGTAATTTAGCGTGA………………………..

Unknown Sequence

0.6 0.3 0.2 0.8 … 0.2 0.4x f(x) ?∑

−≡outputsk

kkd otwE 2)(2

1)(r

iiiii w

Ewwww

∂∂−=∆∆+← η ,

o

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

38

Page 20: Machine Learning and Bioinformatics: An Introductiongabis/DocDiplome/Optimization_RL/...Machine Learning and Bioinformatics: An Introduction Byoung-Tak Zhang Center for Bioinformation

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

39

Tutorial Overview

Neural Networks

Decision Trees

Hidden Markov Models

Clustering Algorithms

Self-Organizing Maps

Bayesian Networks

Evolutionary Computation

Gene Finding

Biological Text Mining

Protein Structure Prediction

Gene Expression Analysis

Gene Expression Analysis

Gene-Drug Dependency

Molecular Diagnosis

(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)

40

More Information

SNU CSE Biointelligence Laboratory♦ http://bi.snu.ac.kr/

=> Courses=> Tutorials

Center for Bioinformation Technology (CBIT)♦ http://cbit.snu.ac.kr/

=> Seminars=> Journal Club