machine learning and bioinformatics: an introductiongabis/docdiplome/optimization_rl/...machine...
TRANSCRIPT
Machine Learning and Bioinformatics:An Introduction
Byoung-Tak Zhang
Center for Bioinformation Technology (CBIT) &
Biointelligence Laboratory
School of Computer Science and Engineering
Seoul National University
Email: [email protected]
http://cbit.snu.ac.kr/
http://bi.snu.ac.kr/
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
2
Outline
What is machine learning? What techniques are available?
What is bioinformatics? What problemsare worth solving?
What machine learning techniques are suitable for which bioinformaticsproblems?
What does this tutorial course aim at?
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
3
Machine Learning
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
4
A Definition of Learning
the improvement
of behavior
on some
performance task
through acquisition
of knowledge
based on partial
task experience
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
5
What Is the Learning Problem?
Learning = improving with experience at some task♦ Improve over task T
♦ With respect to performance measure P
♦ Based on experience E
Example: Learn to play checkers♦ T: Play checkers
♦ P: % of games won in world tournament
♦ E: opportunity to play against self
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
6
Teacher and Learner: Learning System
EnvironmentSolution d
orReward r
Problem x
Teacher
Learner(Student) -
yx
feedback
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
7
Machine Learning: Three Types
Supervised Learning♦ Estimate an unknown mapping from known input- output pairs♦ Learn fw from training set D={(x, d)} s.t.♦ Classification: y is discrete♦ Regression: y is continuous
Unsupervised Learning♦ Only input values are provided♦ Learn fw from D={(x)} s.t.♦ Compression♦ Clustering
Reinforcement Learning♦ Input + reward r are provided sequentially with possible delay♦ Learn fw from D={(x, r(x,y))} s.t. ♦ Maximize the total reward
)()( xxw fdyf ===
xxw =)(f
))},(() ,({max )( yfyrf y xxx ww σλ+=
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
8
Why Machine Learning?
Recent progress in algorithms and theoryGrowing flood of online dataComputational power is availableBudding industry
Three niches for machine learningData mining: using historical data to inprove decisions♦ Medical records -> medical knowledge
Software applications we can’t program by hand♦ Autonomous driving♦ Speech recognition
Self customizing programs♦ Newsreader that learns user interests
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
9
Brief History of Machine Learning
1950’s: Samuels checker player1960’s: Neural networks, perceptron; pattern recognition; learning in the limit theory; Minsky & Papert1970’s: Symbolic concept induction; Winstons’s arch learner; knowledge acquisition bottleneck; Quinlan’s ID3; Michalski’s AQ and soybean diagnosis results; Scientific discovery with BACON; mathematical discovery with AM1980’s: Continued progress on decision-tree and rule learning; Explanation-based learning; speedup learning; utility problem, analogy; resurgence of connectionism (PDP, ANN); Valiant’s PAC learning; experimental evaluation1990’s: Data mining; adaptive software agents & IR; reinforcement learning; theory refinement; inductive logic programming; voting, bagging, boosting, and stacking; learning Bayesian networks
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
10
Possible Uses of Machine Learning
configuration
and design
planning and
scheduling
language
understanding
vision and
speech
execution
and controldiagnostic
reasoning
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
11
Metaphors and Methods for Machine Learning
Neurobiology
Biological
Evolution
Heuristic
Search
Statistical
InferenceMemory and
Retrieval
Connectionist
Learning
Genetic Learning Tree / Rule
Induction
Case-Based
Learning
Probabilistic
Induction
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
12
Methods in Machine Learning (1/2)
Symbolic Learning♦ Version Space Learning♦ Case-Based Learning
Neural Learning♦ Multilayer Perceptrons (MLPs)♦ Self-Organizing Maps (SOMs)♦ Support Vector Machines (SVMs)
Evolutionary Learning♦ Evolution Strategies♦ Evolutionary Programming♦ Genetic Algorithms♦ Genetic Programming
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
13
Methods in Machine Learning (2/2)
Probabilistic Learning♦ Bayesian Networks (BNs)
♦ Helmholtz Machines (HMs)
♦ Latent Variable Models (LVMs)
♦ Generative Topographic Mapping (GTM)
Other Machine Learning Methods♦ Decision Trees (DTs)
♦ Reinforcement Learning (RL)
♦ Boosting Algorithms
♦ Mixture of Experts (ME)
♦ Independent Component Analysis (ICA)
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
14
Example Applications Where ML Has Been Used (1/2)
Banking & Investment♦ Credit card fraud♦ Delinquent accounts♦ Authorization of purchases♦ Predict stock market
Health Care♦ Disease diagnosis♦ Managing resources♦ Look for causal realtionships between environment and disease
Marketing♦ Credit card applications♦ Use past buying habits to predict likelihood of customer purchasing
some new product
Textual datamaing
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
15
Example Applications Where ML Has Been Used (2/2)
Manufacturing - Process control
Bioinformatics
Astronomy
Chemistry
Speech recognition
Machine learning methods applied to signal and image processing
Human Resources - Evaluating job performance
Insurance
Bioinformatics
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
17
What Is DNA?
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCGTTGCTTCG
GCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACAC
GAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGG
CATGCAATCAGTCCCGTTGCTTCGGCACTGTCTGAAAGCGCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCG
TTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCG
CCGGGGCTATTGTACCCGTTGCTTCGGATCTCTTGGGGATCTCTTGGTTCCGGCATGCAATCAGTCCCGTTGCTTCGGC
ACTGTCTGAAAGCGCCTTTGGGCCCAACCTCCCACCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGG
CGGCCGCCGGGGGCACTGTCTGAAAGCTCGGCCGCC
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
18
Molecular Biology: Central Dogma
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
19
Molecular Biology: Flow of Information
DNA RNA Protein Function
DNA
PheCysLysCysAspCysAr
gSe
r AlaLeu
Protein
ACTGG
AAGCTTATC
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
20
Human Genome Project
Genome Health Implications
A New
Disease
Encyclopedia
New Genetic
Fingerprints
New
Diagnostics
New
Treatments
Goals• Identify the approximate 40,000 genes
in human DNA• Determine the sequences of the 3 billion
bases that make up human DNA• Store this information in database• Develop tools for data analysis• Address the ethical, legal and social
issues that arise from genome research
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
21
Bioinformatics
What is a Bioinformatics?Bioinformatics is a new term referring to the discipline that employs computers to store, retrieve, analyze and assist in understanding biological information.
The application of information technology and computer science to the study of biological systems.The analysis of the massive (and constantly increasing) amount of genetic information Sophisticated computer technologies to enable discoveryin all fields of life sciences.
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
22
Topics in Bioinformatics
Structure analysis4 Protein structure comparison4 Protein structure prediction 4 RNA structure modeling
Pathway analysis4Metabolic pathway4 Regulatory networks
Sequence analysis4 Sequence alignment4 Structure and function prediction4 Gene finding
Expression analysis4 Gene expression analysis4 Gene clustering
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
23
Structural Genomics
FunctionalGenomics
ProteomicsPharmaco-genomics
AGCTAGTTCAGTACA
TGGATCCATAAGGTA
CTCAGTCATTACTGC
AGGTCACTTACGATA
TCAGTCGATCACTAG
CTGACTTACGAGAGT
Microarray (Biochip)
Infrastructure of Bioinformatics
Areas and Workflow of Bioinformatics
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
24
Bioinformatics as Information Technology
Bioinformatics
Information
Retrieval
GenBankSWISS-PROT
Hardware
Agent
Machine
Learning
Algorithm
Supercomputing
Information filteringMonitoring agent
ClusteringRule discoveryPattern recognition
Sequence alignment
Biomedical text analysis
Database
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
25
Why Bioinformatics?
Market segment Size (1998) Size (2002)
E-based business-to-business market $800 million $100.0 billionBusiness-to-business biomedical information market $300 million $1.0 billionPhamacogenomics data gathering and analysis alliances $1.0 billion $3.5 billionBiochip-based data gathering and analysis alliances $500 million $4.0 billion
Sources: Cognia (www.cognia.com); Biovista (www.biovista.com)
Bioinformatics Market Size
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
26
Bioinformatics as Business
Software offeringData visualization
Data managementGene and protein analysis
Data filtering and transformation
Clustering and classificationTools supporting laboratory experiment
Data offering
Business structure offering
DNA sequence dataGene expression data
Protein data
Medical genetics dataBiological text data
Networking and service solution
Supercomputer
High performance storage system
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
27
Current Trend
UniversityResearchInstitute
Companies
Bioinformat-ics Center
Collaboration for Research and Development
Integration of multiple data sourcesDescription of causal relationshipsSimulation of biological processesPrediction of anomalyGeneration of hypothesesLiterature summary for automatic data collection
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
28
Major Companies and Their Areas of Interest
oMolecular Mining
o
o
Web-portal
ooLion Bioscience
o
PathwayAnalysis
o
o
o
SNP Analysis
o
Protein Analysis
o
EST-Clustering
Microarray Analysis
oCelera Paracel Inc.
oInformax
oSilicon Genetics
oRosetta Inpharmatics
eBioinformatics
DoubleTwist
oCompugen
AreasCompanies in Bioinformatics
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
29
IT Companies Offering Bioinformatics Products
SGI offers visual computing and high-performance computer systems.SGI systems support wide variety of bioinformatics software applications.
Silicon Graphics
IBM is conducing research into high value-added data mining and protein structure determination methods. IBM offers a variety of enterprise-wide IT solutions for the life science market, and recently initiated a collaboration with NetGenics
IBM
In 1999, Agilent entered into a strategic collaboration with Rosetta Inpharmatics to make and sell gene expresion analysis systems, including hardware and software.
Description
Compaq has a majar strategic alliance with Celeara to provide integrated bioinformatics hardware, software, networking and service solutions.
Compaq
Sun systems support a wide variety of bioinformatics software applications Sun Microsystems
AgilentTechnologies
Companies
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
30
Bioinformatics Tools
2D Gel, MALDI-TOFHardwares for Proteomics
GeneX, GOE, MAT, GeNetDNA Microarray
Bend.it, RNA Draw, NNPREDICT, SWISS-MODEL
Structure Prediction
GRAIL, FGENEH, tRNAscan-SE, NNPP, eMOTIF, PROSITE, ChloroP
Pattern Finding
Clustal W, MacawMultiple Sequence Alignment
BLAST, FASTASequence Alignment
Tools or DatabasesProblems
Machine Learning in Bioinformatics
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
32
General Scheme for Bio Data Mining
Data preprocessing:
- Normalization
- Discretization
- Gene selectionLearning:
- Greedy search
- EM algorithm
…
…
- Classification
- Clustering
- Dependency Analysis
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
33
Machine Learning Applied to Bioinformatics
Sequence Alignment♦ Simulated Annealing♦ Genetic Algorithms
Structure and Function Prediction♦ Hidden Markov Models♦ Multilayer Perceptrons♦ Decision Trees
Molecular Clustering and Classification♦ Support Vector Machines♦ Nearest Neighbor Algorithms
Expression (DNA Chip Data) Analysis♦ Self-Organizing Maps♦ Bayesian Networks
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
34
Example: Gene Finding
UpstreamOpen Reading Frame
Downstream
mRNA
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
35
General Learning Scheme
Training setAATGCGTACCTCATACGACCACAACGAATGAATATGATGT………
Test setTCGACTACGAGCCTCATCGACGAACGAATGAATATGATGT………
PredictionMethod
Learning (Model Construction)
Outputinput
input
output
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
36
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
37
Neural Networks in GRAIL
CATATTCAAGAATTGAAGCGTGTAGTCCTGACTTGAGAGCTGTAGATGACGTGCTTATATGTTC………………………..
Known Sequence
0.7 0.8 0.1 0.3 … 0.9 0.2
0.4 0.2 0.6 0.1 … 0.4 0.5
…
x1x2
xn
0.2 0.9 0.3 0.1 … 0.8 0.3
0.6 0.3 0.2 0.8 … 0.2 0.4
Coding potential valueGC Composition
LengthDonor
Intron vocabulary
1
0
0
1
t1t2
tn
x3 t3
Exon
Input Layer
Hidden Layer
Output Layer
Weights
Training
Preprocessing
Testing
ATGACGTACGATCCCGTGACGGTGACGTGAGCTGACGTGCCGTCGTAGTAATTTAGCGTGA………………………..
Unknown Sequence
0.6 0.3 0.2 0.8 … 0.2 0.4x f(x) ?∑
∈
−≡outputsk
kkd otwE 2)(2
1)(r
iiiii w
Ewwww
∂∂−=∆∆+← η ,
o
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
38
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
39
Tutorial Overview
Neural Networks
Decision Trees
Hidden Markov Models
Clustering Algorithms
Self-Organizing Maps
Bayesian Networks
Evolutionary Computation
Gene Finding
Biological Text Mining
Protein Structure Prediction
Gene Expression Analysis
Gene Expression Analysis
Gene-Drug Dependency
Molecular Diagnosis
(c) 2002 SNU Biointelligence Lab and Center for Bioinformation Technology (CBIT)
40
More Information
SNU CSE Biointelligence Laboratory♦ http://bi.snu.ac.kr/
=> Courses=> Tutorials
Center for Bioinformation Technology (CBIT)♦ http://cbit.snu.ac.kr/
=> Seminars=> Journal Club