notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 1

Introduction to class


OutlineOutline

• Introduction to class

• Introduction to machine learning / data mining

• Introduction to the Life Sciences

• Example and importance of microarray data


Introduction to Class

• This class focuses on learning how to apply data mining to biological and medical fields to solve some of their problems.

• Does not require prior knowledge in the application areas.

• Does not require prior knowledge in machine learning and/or data mining.


Introduction to Class

• Data mining specialized in– Statistical data analysis and inference – SPSS, R-language– Clustering – SPSS, Gene Pattern– Machine learning - Rapid Miner– Classification – Rapid Miner ,R-language.

• Requirement: use biological datasets and/or medical datasets.

• Seattle area has many renowned research institutes.


Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001


The Human Genome Project

• The Human Genome Project


Data Mining Motivation: “Necessity is the Mother of Invention”

• Data explosion problem

– Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data

warehouses and other information repositories

• We are drowning in data, but starving for knowledge!

• Solution: Data warehousing and data mining

– Data warehousing and on-line analytical processing

– Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases


What Is Data Mining?• Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

• Alternative names and their “inside stories”: – Data mining: a misnomer?– Knowledge discovery(mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• What is not data mining?– (Deductive) query processing. – Expert systems or small ML/statistical programs are often a

part of data mining


What Is Data Mining?• Data mining (knowledge discovery in databases)

is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories.

• Machine learning and knowledge discovery are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery.


Data Mining: A KDD Process

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation


Machine Learning Functionalities (1)

• Concept description: Characterization and discrimination– Generalize, summarize, and contrast data characteristics, e.g., dry

vs. wet regions

• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association

– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

– contains(T, “computer”) contains(x, “software”) [1%, 75%]

– Diaper Beer [0.5%, 75%]


Machine Learning Functionalities (2)• Classification and Prediction

– Finding models (functions) that describe and distinguish classes or concepts for future prediction

– E.g., classify countries based on climate, or classify cars based on gas mileage

– Presentation: decision-tree, classification rule, neural network

– Prediction: Predict some unknown or missing numerical values

• Cluster analysis– Class label is unknown: Group data to form new classes, e.g.,

cluster houses to find distribution patterns

– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity


Machine Learning Functionalities (3)• Outlier analysis

– Outlier: a data object that does not comply with the general behavior of the data

– It can be considered as noise or exception but is quite useful in fraud detection,

rare events analysis

• Trend and evolution analysis

– Trend and deviation: regression analysis

– Sequential pattern mining, periodicity analysis

– Similarity-based analysis

• Other pattern-directed or statistical analyses


Are All the “Discovered” Patterns Interesting?

• A data mining or machine learning system/query may generate

thousands of patterns, not all of them are interesting.

– Suggested approach: Human-centered, query-based, focused mining

• Interestingness measures: A pattern is interesting if it is easily

understood by humans, valid on new or test data with some degree of

certainty, potentially useful, novel, or validates some hypothesis that a

user seeks to confirm

• Objective vs. subjective interestingness measures:

– Objective: based on statistics and structures of patterns, e.g., support,

confidence, etc.

– Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,

actionability, etc.


Can We Find All and Only Interesting Patterns?

• Find all the interesting patterns: Completeness

– Can a data mining or machine learning system find all the interesting

patterns?

– Association vs. classification vs. clustering

• Search for only interesting patterns: Optimization

– Can a data mining or machine learning system find only the interesting

patterns?

– Approaches

• First general all the patterns and then filter out the uninteresting ones.

• Generate only the interesting patterns—mining query optimization


Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization


Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

• Different views, different classifications

– Kinds of databases to be mined

– Kinds of knowledge to be discovered

– Kinds of techniques utilized

– Kinds of applications adapted


Architecture of a Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base


Introduction to the Life SciencesIntroduction to the Life Sciences

• What is human DNA ?– DNA stands for DeoxyriboNucleic Acid– DNA stores the genetic material chromosomes in each

cell nucleus– DNA is transcribed into RNA out of the nucleus

(transcription)– RNA stands for RiboNucleic Acid– RNA is translated into proteins in a cytoplasm

organism called a ribosome (translation) – DNA RNA proteins



DNA

mRNA rRNA tRNA

transcription

Ribosome

Protein

translation


Introduction to the Life SciencesIntroduction to the Life Sciences• Gene expressions are any molecular

compound produced from genes (ex: RNA)

Genes are expressed by being transcribed into RNA, and this transcript may then be translated into protein.

http://en.wikipedia.org/wiki/File:Genetic_code.svg

http://en.wikipedia.org/wiki/File:Genetic_code.svg



• DNA and RNA are composed of– Nucleotides (nucleic acid molecules)

• Pyrimidines– Cytosine (C) (DNA & RNA)– Thymine (T) (DNA)– Uracil (U) (RNA)

• purines – Adenine (A) (DNA & RNA)– Guanine (G) (DNA & RNA)

– Oses (Ribose for RNA, Deoxyribose for DNA)



• Succession of nucleotides composes a single strand in DNA

• Two strands of DNA pair themselves in the 3-D shape of a double helix, where bases are paired (bp = base pair)

• Pairing of the bases (A=T, G C) provides chemical bonds responsible for the double helix shape.


Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001



• Genes– A gene is a part of the genome that can be translated– A gene may encode a protein or RNA sequence– Genes are separated by non coding regions– Genes are concentrated in certain regions of the

genome rich in G and C – Regions rich in A and T do not contain genes– Between the two, CpG islands (repetition of C and G)

separate coding regions from non coding ones– Non coding regions can be parts of genes



• Genomes, diversity, size, structure– Profound diversity of living organisms genome.– DNA (cells), DNA or RNA (phage, virus)– Direction: from 5’ to 3’ of molecule (double stranded DNA),

or both directions (single stranded)– Genome organized or not in chromosomes– Human genome: 22 chromosomes, 3 billion bases, 30,000

genes– Other species genome vary in size and number of genes– Human genome has only twice as many genes than a

primitive worm– GenBank database



• Proteomes– The proteome is the set of proteins that can be

expressed from a genome– Determination of:

• Sequence of encoding genes• Location of the genes• Function of protein encoding genes• Different biochemical states (phosphorylation,

glycosylation, co-enzymes…)



• Gene ontologies– Gene ontology consortium

• Dynamic controlled vocabulary to describe– Molecular function (Ex: DNA polymerase, …)

– Biological process (Ex: DNA synthesis, respiration, …)

– Cellular component (Ex: nucleus, ribosome, …)


Principles of BioinformaticsPrinciples of Bioinformatics

• Biological information– Molecules at the basis of life can be

represented as digital symbol strings (DNA, RNA, …)

– Digital symbols (monomers) constitute an alphabet

– Unique representation– Importance of probabilistic models



• Database annotation quality– In addition to natural noise, data are distorted

by people’s annotations (curation of the data)– Resulting error is very significant– Reasons:

• Storage of positions in a sequence, not content

• Difficulty of storing content

– Need to check the data



• Database redundancy– Different representations: RNA, cDNA (corresponding

complementary)– Different methods: single-pass sequence, multi-fold

repetition of a sequence– Different fragments: pre-mRNA can lead to several

levels of splicing in cDNA, alternative splicing– Redundancy is source of error:

• Bias of over represented fragments for closely related segments• Bias of over represented fragments for correlations• Overestimate prediction if input and output are related



• Database redundancy– Better to clean the data first– Data mining cleaning methods apply– Difficulty to differentiate between true

analogous sequences, and related ones– Sequence profile describes amino acid

variation in a family of sequences



• Main bioinformatics questions– Determine the exact transition between coding and non

coding regions of genes

– Find genes in prokaryotes and eukaryotes

– Determine transcription initiation and termination

– Sequence clustering and cluster topology

– Protein structure prediction

– Protein function prediction

– Protein family classification



• Question– Propose questions pertinent for bioinformatics

– Propose questions pertinent for medical informatics

notes #1

Documents

class data mining

data warehouses

data dredging

data archeology

group data

data object

data mining motivation

statistical data analysis