bio-medical informatics instructor : hanif yaghoobi website: site444703.44.webydo.com e-mail :...

Post on 25-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bio-Medical Informatics

Instructor : Hanif YaghoobiWebsite: site444703.44.webydo.com

E-mail : Hyiautcourse@gmail.comMy personal Mail: hanifeyaghoobi@gmail.com

About this Course

• Activities during the semester 5 score:1)Home Works2) MATLAB exercises• Your Final Projects 3 score• Final Exam 12 score

Shortliffe

“ Medical informatics is the rapidly developing scientific field that deals with resources, devices and formalized methods for optimizing the storage, retrieval and management of biomedical information for problem solving and decision making”

Edward Shortliffe, MD, PhD

1995

Organisms

• Classified into two types:

• Eukaryotes: contain a membrane-bound nucleus and organelles (plants, animals, fungi,…)

• Prokaryotes: lack a true membrane-bound nucleus and organelles (single-celled, includes bacteria)

• Not all single celled organisms are prokaryotes!

15

Cells

• Complex system enclosed in a membrane

• Organisms are unicellular (bacteria, baker’s yeast) or multicellular

• Humans:– 60 trillion cells – 320 cell types

16

Example Animal Cellwww.ebi.ac.uk/microarray/ biology_intro.htm

DNA Basics – cont.

• DNA in Eukaryotes is organized in chromosomes.

17

Chromosomes

• In eukaryotes, nucleus contains one or several double stranded DNA molecules orgainized as chromosomes

• Humans: – 22 Pairs of autosomes– 1 pair sex chromosomes

18

Human Karyotype http://avery.rutgers.edu/WSSP/StudentScholars/

Session8/Session8.html

19www.biotec.or.th/Genome/whatGenome.html

What is DNA?

• DNA: Deoxyribonucleic Acid

• Single stranded molecule (oligomer, polynucleotide) chain of nucleotides

• 4 different nucleotides:– Adenosine (A)– Cytosine (C)– Guanine (G)– Thymine (T)

20

Nucleotide Bases

• Purines (A and G)• Pyrimidines (C and T)• Difference is in base structure

21

Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm

DNA

22

23

The Central DogmaProtein Synthesis

Cell Function

Genome Transcriptome Proteome

Transcription Translation

Gene Expression

Level

Genome

• chromosomal DNA of an organism

• number of chromosomes and genome size varies quite significantly from one organism to another

• Genome size and number of genes does not necessarily determine organism complexity

28

Genome Comparison

29

ORGANISM CHROMOSOMES GENOME SIZE GENES

Homo sapiens (Humans)

23 3,200,000,000 ~ 30,000

Mus musculus(Mouse)

20 , 2600,000,000 ~30,000

Drosophila melanogaster

(Fruit Fly)

4 180,000,000 ~18,000

Saccharomyces cerevisiae (Yeast)

16 14,000,000 ~6,000

Zea mays (Corn) 10 2,400,000,000 ???

30

DNA Basics – cont.

• The DNA in each chromosome can be read as a discrete signal to {a,t,c,g}. (For example: atgatcccaaatggaca…)

31

DNA Basics – cont.

• In genes (protein-coding region), during the construction of proteins by amino acids, these nucleotides (letters) are read as triplets (codons). Every codon signals one amino acid for the protein synthesis (there are 20 aa).

32

DNA Basics – cont.

• There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions).

33

…CATTGCCAGT…

DNA Basics – Cont.

34

…CATTGCCAGT…

Start: ATG

Stop: TAA, TGA, TAG

gene

Exon ExonExon IntronIntron Exon

Understanding Genome Sequences~3,289,000,000 characters:

aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc cagcactttgggagatcgaggagggaggatcacctgaggtcaggagttac agacatggagaaaccccgtctctactaaaaatacaaaattagcctggcgt ggtggcgcatgcctgtaatcccagctactcgggaggctgaggcaggagaa tcgcttgaacccgggagcggaggttgcggtgagccgagatcgcaccgttg cactccagcctgggcgacagagcgaaactgtctcaaacaaacaaacaaaa aaacctgatacatggtatgggaagtacattgtttaaacaatgcatggaga tttaggttgtttccagtttttactggcacagatacggcaatgaatataat tttatgtatacattcatacaaatatatcggtggaaaattcctagaagtgg aatggctgggtcagtgggcattcatattgagaaattggaaggatgttgtc aaactctgcaaatcagagtattttagtcttaacctctcttcttcacaccc ttttccttggaagaaagctaaatttagacttttaaacacaaaactccatt ttgagacccctgaaaatctgggttcaaagtgtttgaaaattaaagcagag gctttaatttgtacttatttaggtataatttgtactttaaagttgttcca

. . . 35

Goal: Identify components encoded in the DNA sequence

Open Reading Frame

• Protein-encoding DNA sequence consists of a sequence of 3 letter codons

• Starts with the START codon (ATG)• Ends with a STOP codon (TAA, TAG, or TGA)

36

ATGCTCAGCGTGACCTCA . . . CAGCGTTAA

M L S V T S . . . Q R STP

Finding Open Reading Frames

Try all possible starting points• 3 possible offsets• 2 possible strands

Simple algorithm finds all ORFs in a genome• Many of these are spurious (are not real genes)• How do we focus on the real ones?

37

ATGCTCAGCGTGACCTCA . . . CAGCGTTAA

M L S V T S . . . Q R STP

Using Additional Genomes

Basic premise“What is important is conserved”

Evolution = Variation + Selection– Variation is random– Selection reflects function

Idea: • Instead of studying a single genome, compare related

genomes• A real open reading frame will be conserved

38

Phylogentic Tree of Yeasts

39Kellis et al, Nature 2003

S. cerevisiae

S. paradoxus

S. mikataeS. bayanus

C. glabrata

S. castellii

K. lactis

A. gossypii

K. waltii

D. hansenii

C. albicans

Y. lipolytica

N. crassa

M. graminearum

M. grisea

A. nidulans

S. pombe

~10M years

Evolution of Open Reading Frame

40

ATGCTCAGCGTGACCTCA . . . ATGCTCAGCGTGACATCA . . . ATGCTCAGGGTGACA--A . . . ATGCTCAGG---ACA--A . . .

S. cerevisiaeS. paradoxusS. mikataeS. bayanus

Conservedpositions

Variablepositions

A deletion

Frame shiftchanges interpretationof downstream seq

ExamplesSpurious ORF

41

Frame shift

[Kellis et al, Nature 2003]

Sequencingerror

Confirmed ORF

ConservedVariable

ATG notconserved

Greedy algorithm to find conserved ORFs surprisingly effective (> 99% accuracy) on verified yeast data

Defining ConservationNaïve approach• Consensus between all

speciesProblem: • Rough grained• Ignores distances between species• Ignores the tree topology

Goal:• More sensitive and

robust methods42

AAAA

AA

AA

A

AAAA

CC

CC

C

ACAG

TC

GG

T

CCCA

CA

AA

C

Conserved

Variable

100% conserv 33 5555

Bioinformatics – an area of emerging knowledge

• Each cell of the body contains the whole DNA of the individual (about 40,000 genes in the human genome, each of them comprising from 50 to a mln base pairs – A,T,C or G)

• The Main Dogma in Genetics: DNA->RNA->proteins

• Transcription: DNA (about 5%) -> mRNA – DNA -> pre-RNA -> splicing -> mRNA (only the exons)

• Translation: mRNA -> proteins– Proteins make cells alive and specialised (e.g. blue eyes)– Genome -> proteome N.Kasabov, 2003

Bioinformatics

• The area of Science that is concerned with the development and applications of methods, tools and systems for storing and processing of biological information to facilitate knowledge discovery.

• Interdisciplinary: Information and computer science, Molecular Biology, Biochemistry, Genetics, Physics, Chemistry, Health and Medicine, Mathematics and Statistics, Engineering, Social Sciences.

• Biology, Medicine -- Information Science --> IT, Clinics, Pharmacy, I____________________I • Links to Health informatics, Clinical DSS, Pharmaceutical Industry

N.Kasabov, 2003

N.Kasabov, 2003

Bioinformatics: challenging problems for computer and information sciences

• Discovering patterns (features) from DNA and RNA sequences (e.g. genes, promoters, RBS binding sites, splice junctions)

• Analysis of gene expression data and predicting protein abundance

• Discovering of gene networks – genes that are co-regulated over time

• Protein discovery and protein function analysis

• Predicting the development of an organism from its DNA code (?)

• Modeling the full development (metabolic processes) of a cell (?)

• Implications: health; social,…

N.Kasabov, 2003

Problems in Computational Modeling for Bioinformatics

• Abundance of genome data, RNA data, protein data and metabolic pathway data is now available (see http://www.ncbi.nlm.nih.gov) and this is just the beginning of computational modeling in Bioinformatics

• Complex interactions:– between proteins, genes, DNA code, – between the genome and the environment – much yet to to be discovered

• Stability and repetitiveness: Genes are relatively stable carriers of information.

• Many sources of uncertainty:– Alternative splicing– Mutation in genes caused by: ionising radiation (e.g. X-rays); chemical contamination, replication

errors, viruses that insert genes into host cells, aging processes, etc.– Mutated genes express differently and cause the production of different proteins

• It is extremely difficult to model dynamic, evolving processes

Bioinformatics Important Challenges

Transcription Translation

Gene Predication

Gene FunctionProtein FunctionProtein 3D Structure

Public Data Base

Transcription Translation

DNA sequence {A,T,C,G}

Microarray Protein sequenceKMLSLLMARTYW

Gene Expression

Level

Gene Expression

49

Microarray • What can it be used for? • How does it work?• What are the Advantages?

An Example Application

Microarrays can be used for:Comparison of transcription levels between two cells

Examples:Comparison between:Cells from a young mouse vs cell from an old mouse

Drug efficacy:Treated cells vs untreated cells

How it works:Based on hybridization

A =C ≡T =T =G ≡A =C ≡C ≡ ▀

UGAACUGG

A C T T GA C C ▀

TGAACTGG

UGAACUGG

A =C ≡T =T =A ≡A =C ≡C ≡ ▀

UGAAUUGG

UGAAUUGG

mRNA

A =C ≡T =T =A ≡A =C ≡C ≡ ▀

MicrotiterPlates

Print Head

slides (100)

Probes and the printing process

Print HeadPins

Print Head with Pins

23/2/2008 60

Microarray Technology

probe(on chip)

sample(labelled)

pseudo-colourimage

[image from Jeremy Buhler]

Experimental design Track what’s on the chip

which spot corresponds to which gene

Duplicate experimental spots reproducibility

Controls DNAs spotted on glass

positive probe (induced or repressed)negative probe (bacterial genes on human chip)

oligos on glass or synthesised on chip (Affymetrix)point mutants (hybridisation plus/minus)

Images from scanner Resolution

standard 10m [currently, max 5m] 100m spot on chip = 10 pixels in diameter

Image format TIFF (tagged image file format) 16 bit (65’536 levels of grey) 1cm x 1cm image at 16 bit = 2Mb (uncompressed) other formats exist e.g.. SCN (used at Stanford University)

Separate image for each fluorescent sample channel 1, channel 2, etc.

Images in analysis software The two 16-bit images (cy3, cy5) are compressed into 8-bit images Goal : display fluorescence intensities for both wavelengths using a

24-bit RGB overlay image RGB image :

Blue values (B) are set to 0 Red values (R) are used for cy5 intensities Green values (G) are used for cy3 intensities

Qualitative representation of results

Images : examples

cy3

cy5Spot color Signal strength Gene

expression

yellow Control = perturbed unchanged

red Control < perturbed induced

green Control > perturbed repressed

Pseudo-color overlay

Data : DNA Microarray

23/2/2008 66

0 10 20 30 40 50 60time (min)

gene 1

gene 2

gene 3

assay

Data Required: Gene Expression Matrix

t1 t2 t3 t4

g1 0 1 2 1

g2 1 2 1 0

g3 0 1 1 1.

g4 1 2 1 0

23/2/2008 67

Data Required: Gene Expression Matrix

a1 a2 a3 a4

g1 0 3 1 1

g2 1 2 1 0

g3 0 1 1 1.

g4 1 2 1 0

23/2/2008 68

Snap Shot

t1 t2 t3 t4

g1 0 1 2 1

g2 1 2 1 0

g3 0 1 1 1.

g4 1 2 1 0

Time serious

• World Health Organization

top related