notes #1
TRANSCRIPT
1/4/2010 TCSS588A Isabelle Bichindaritz 1
Introduction to class
1/4/2010 TCSS588A Isabelle Bichindaritz 2
OutlineOutline
• Introduction to class
• Introduction to machine learning / data mining
• Introduction to the Life Sciences
• Example and importance of microarray data
1/4/2010 TCSS588A Isabelle Bichindaritz 3
Introduction to Class
• This class focuses on learning how to apply data mining to biological and medical fields to solve some of their problems.
• Does not require prior knowledge in the application areas.
• Does not require prior knowledge in machine learning and/or data mining.
1/4/2010 TCSS588A Isabelle Bichindaritz 4
Introduction to Class
• Data mining specialized in– Statistical data analysis and inference – SPSS, R-language– Clustering – SPSS, Gene Pattern– Machine learning - Rapid Miner– Classification – Rapid Miner ,R-language.
• Requirement: use biological datasets and/or medical datasets.
• Seattle area has many renowned research institutes.
1/4/2010 TCSS588A Isabelle Bichindaritz 5
Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
1/4/2010 TCSS588A Isabelle Bichindaritz 6
The Human Genome Project
• The Human Genome Project
1/4/2010 TCSS588A Isabelle Bichindaritz 7
Data Mining Motivation: “Necessity is the Mother of Invention”
• Data explosion problem
– Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
• We are drowning in data, but starving for knowledge!
• Solution: Data warehousing and data mining
– Data warehousing and on-line analytical processing
– Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
1/4/2010 TCSS588A Isabelle Bichindaritz 8
What Is Data Mining?• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
• Alternative names and their “inside stories”: – Data mining: a misnomer?– Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
• What is not data mining?– (Deductive) query processing. – Expert systems or small ML/statistical programs are often a
part of data mining
1/4/2010 TCSS588A Isabelle Bichindaritz 9
What Is Data Mining?• Data mining (knowledge discovery in databases)
is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories.
• Machine learning and knowledge discovery are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery.
1/4/2010 TCSS588A Isabelle Bichindaritz 10
Data Mining: A KDD Process
– Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
1/4/2010 TCSS588A Isabelle Bichindaritz 11
Machine Learning Functionalities (1)
• Concept description: Characterization and discrimination– Generalize, summarize, and contrast data characteristics, e.g., dry
vs. wet regions
• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association
– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]
– contains(T, “computer”) contains(x, “software”) [1%, 75%]
– Diaper Beer [0.5%, 75%]
1/4/2010 TCSS588A Isabelle Bichindaritz 12
Machine Learning Functionalities (2)• Classification and Prediction
– Finding models (functions) that describe and distinguish classes or concepts for future prediction
– E.g., classify countries based on climate, or classify cars based on gas mileage
– Presentation: decision-tree, classification rule, neural network
– Prediction: Predict some unknown or missing numerical values
• Cluster analysis– Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
1/4/2010 TCSS588A Isabelle Bichindaritz 13
Machine Learning Functionalities (3)• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of the data
– It can be considered as noise or exception but is quite useful in fraud detection,
rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
1/4/2010 TCSS588A Isabelle Bichindaritz 14
Are All the “Discovered” Patterns Interesting?
• A data mining or machine learning system/query may generate
thousands of patterns, not all of them are interesting.
– Suggested approach: Human-centered, query-based, focused mining
• Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree of
certainty, potentially useful, novel, or validates some hypothesis that a
user seeks to confirm
• Objective vs. subjective interestingness measures:
– Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
– Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,
actionability, etc.
1/4/2010 TCSS588A Isabelle Bichindaritz 15
Can We Find All and Only Interesting Patterns?
• Find all the interesting patterns: Completeness
– Can a data mining or machine learning system find all the interesting
patterns?
– Association vs. classification vs. clustering
• Search for only interesting patterns: Optimization
– Can a data mining or machine learning system find only the interesting
patterns?
– Approaches
• First general all the patterns and then filter out the uninteresting ones.
• Generate only the interesting patterns—mining query optimization
1/4/2010 TCSS588A Isabelle Bichindaritz 16
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology
Statistics
OtherDisciplines
InformationScience
MachineLearning Visualization
1/4/2010 TCSS588A Isabelle Bichindaritz 17
Data Mining: Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views, different classifications
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
1/4/2010 TCSS588A Isabelle Bichindaritz 18
Architecture of a Typical Data Mining System
Data Warehouse
Data cleaning & data integration Filtering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
1/4/2010 TCSS588A Isabelle Bichindaritz 19
Introduction to the Life SciencesIntroduction to the Life Sciences
• What is human DNA ?– DNA stands for DeoxyriboNucleic Acid– DNA stores the genetic material chromosomes in each
cell nucleus– DNA is transcribed into RNA out of the nucleus
(transcription)– RNA stands for RiboNucleic Acid– RNA is translated into proteins in a cytoplasm
organism called a ribosome (translation) – DNA RNA proteins
1/4/2010 TCSS588A Isabelle Bichindaritz 20
Introduction to the Life SciencesIntroduction to the Life Sciences
DNA
mRNA rRNA tRNA
transcription
Ribosome
Protein
translation
1/4/2010 TCSS588A Isabelle Bichindaritz 21
Introduction to the Life SciencesIntroduction to the Life Sciences• Gene expressions are any molecular
compound produced from genes (ex: RNA)
Genes are expressed by being transcribed into RNA, and this transcript may then be translated into protein.
1/4/2010 TCSS588A Isabelle Bichindaritz 22
Introduction to the Life SciencesIntroduction to the Life Sciences
• DNA and RNA are composed of– Nucleotides (nucleic acid molecules)
• Pyrimidines– Cytosine (C) (DNA & RNA)– Thymine (T) (DNA)– Uracil (U) (RNA)
• purines – Adenine (A) (DNA & RNA)– Guanine (G) (DNA & RNA)
– Oses (Ribose for RNA, Deoxyribose for DNA)
1/4/2010 TCSS588A Isabelle Bichindaritz 23
Introduction to the Life SciencesIntroduction to the Life Sciences
• Succession of nucleotides composes a single strand in DNA
• Two strands of DNA pair themselves in the 3-D shape of a double helix, where bases are paired (bp = base pair)
• Pairing of the bases (A=T, G C) provides chemical bonds responsible for the double helix shape.
1/4/2010 TCSS588A Isabelle Bichindaritz 24
Introduction to the Life SciencesIntroduction to the Life Sciences
1/4/2010 TCSS588A Isabelle Bichindaritz 25
Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
1/4/2010 TCSS588A Isabelle Bichindaritz 26
Introduction to the Life SciencesIntroduction to the Life Sciences
• Genes– A gene is a part of the genome that can be translated– A gene may encode a protein or RNA sequence– Genes are separated by non coding regions– Genes are concentrated in certain regions of the
genome rich in G and C – Regions rich in A and T do not contain genes– Between the two, CpG islands (repetition of C and G)
separate coding regions from non coding ones– Non coding regions can be parts of genes
1/4/2010 TCSS588A Isabelle Bichindaritz 27
Introduction to the Life SciencesIntroduction to the Life Sciences
• Genomes, diversity, size, structure– Profound diversity of living organisms genome.– DNA (cells), DNA or RNA (phage, virus)– Direction: from 5’ to 3’ of molecule (double stranded DNA),
or both directions (single stranded)– Genome organized or not in chromosomes– Human genome: 22 chromosomes, 3 billion bases, 30,000
genes– Other species genome vary in size and number of genes– Human genome has only twice as many genes than a
primitive worm– GenBank database
1/4/2010 TCSS588A Isabelle Bichindaritz 28
Introduction to the Life SciencesIntroduction to the Life Sciences
• Proteomes– The proteome is the set of proteins that can be
expressed from a genome– Determination of:
• Sequence of encoding genes• Location of the genes• Function of protein encoding genes• Different biochemical states (phosphorylation,
glycosylation, co-enzymes…)
1/4/2010 TCSS588A Isabelle Bichindaritz 29
Introduction to the Life SciencesIntroduction to the Life Sciences
• Gene ontologies– Gene ontology consortium
• Dynamic controlled vocabulary to describe– Molecular function (Ex: DNA polymerase, …)
– Biological process (Ex: DNA synthesis, respiration, …)
– Cellular component (Ex: nucleus, ribosome, …)
1/4/2010 TCSS588A Isabelle Bichindaritz 30
Principles of BioinformaticsPrinciples of Bioinformatics
• Biological information– Molecules at the basis of life can be
represented as digital symbol strings (DNA, RNA, …)
– Digital symbols (monomers) constitute an alphabet
– Unique representation– Importance of probabilistic models
1/4/2010 TCSS588A Isabelle Bichindaritz 31
Principles of BioinformaticsPrinciples of Bioinformatics
• Database annotation quality– In addition to natural noise, data are distorted
by people’s annotations (curation of the data)– Resulting error is very significant– Reasons:
• Storage of positions in a sequence, not content
• Difficulty of storing content
– Need to check the data
1/4/2010 TCSS588A Isabelle Bichindaritz 32
Principles of BioinformaticsPrinciples of Bioinformatics
• Database redundancy– Different representations: RNA, cDNA (corresponding
complementary)– Different methods: single-pass sequence, multi-fold
repetition of a sequence– Different fragments: pre-mRNA can lead to several
levels of splicing in cDNA, alternative splicing– Redundancy is source of error:
• Bias of over represented fragments for closely related segments• Bias of over represented fragments for correlations• Overestimate prediction if input and output are related
1/4/2010 TCSS588A Isabelle Bichindaritz 33
Principles of BioinformaticsPrinciples of Bioinformatics
• Database redundancy– Better to clean the data first– Data mining cleaning methods apply– Difficulty to differentiate between true
analogous sequences, and related ones– Sequence profile describes amino acid
variation in a family of sequences
1/4/2010 TCSS588A Isabelle Bichindaritz 34
Principles of BioinformaticsPrinciples of Bioinformatics
• Main bioinformatics questions– Determine the exact transition between coding and non
coding regions of genes
– Find genes in prokaryotes and eukaryotes
– Determine transcription initiation and termination
– Sequence clustering and cluster topology
– Protein structure prediction
– Protein function prediction
– Protein family classification
1/4/2010 TCSS588A Isabelle Bichindaritz 35
Principles of BioinformaticsPrinciples of Bioinformatics
• Question– Propose questions pertinent for bioinformatics
– Propose questions pertinent for medical informatics