sequence analysis of cpg islands reveals possible functional correlation between genes and its cpg...

25
Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School of Informatics Indiana University

Upload: eileen-rich

Post on 18-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Sequence analysis of CpG islands reveals possiblefunctional correlation between genes and its CpG

island sequence

Henry Hyun-il Paik

Bioinformatics, School of Informatics

Indiana University

Page 2: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Outline

• What CpG islands are

• The Known Relations between CpG islands and Genes

• Motivation and Goal

• Data set

• Procedures

• Results

• Discussion

Page 3: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

What CpG islands are?

• CpG dinucleotides are rare in mammal DNA

• DNA Methylation only occurs at CpG sites• Methylated cytosines may be converted to thymine by

deamination over evolution– CpG TpG

• CpG islands are short stretches of DNA with higher frequency of the CG sequence

• Usually they are not methylated

Page 4: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

What CpG islands are?

• Definition from Gardiner-Garden & Frommer– At least 200 bases long– G+C content: > 50%– observed CpG/expected CpG ratio: >= 0.6

• Definition from Takai & Jones – Longer than 500 bp– G+C content: > 55%– observed CpG/expected CpG ratio: >= 0.65– With this definition, these CpGi’s are more likely to be

associated with the 5’ regions of genes and exclude most Alu’s

• There are about 29,000 such regions in the human genome

Page 5: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

What CpG islands are?

Page 6: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

CpG islands & Genes

• CpG islands located in the promoter regions of genes can play important roles in gene silencing

• Housekeeping genes– Almost all housekeeping genes are associated with at least one

CpG island– CpG islands are starting 5’ to the transcription start site and

covering one or more exons and introns

• Tissue specific genes– About 40 % tissue specific genes are associated with islands– The position of these islands is not strongly toward the

transcription start site as in the housekeeping genes

Page 7: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

CpG islands & Genes

• Not all CpG islands are associated with genes– Ioshikhes & Zhang determined the features to discriminate the

promoter-associated and non-associated CpG islands

• There are methylation-prone and methylation-resistant CpG islands– Feltus et. al. found patterns to discriminate methylation-prone

from methylation-resistant CpG islands

Page 8: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

CpG islands & Genes

Gene

5’ end

CpGi

Gene

Promoter CpG islands

Gene CpG islands in body

Gene 3’ end CpG islands

Page 9: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Motivation and Objective

• Our project was inspired by these ideas• Mechanical definition follows the definition as it is

– At least 200 bases long– G+C content: > 50%– observed CpG/expected CpG ratio: >= 0.6

• We tried to find “Semantic meaning” of CpG islands : Co-relation between CpG islands & Gene Functions

• Are there any significant CpGi patterns related to the gene functions?

Page 10: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Motivation and Objective

Gene 1CpGi 1

Gene 2CpGi 2

We assume that gene1 and gene2 have similar function

1) Then gene 1 sequence and gene 2 sequence are probably similar.

2) Our Goal is to find CpGi patterns when genes have similar function

Page 11: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Data Set• Reference:• Larsen F., Gundersen, G., Lopez L., Prydz H.• CpG island as Gene Markers in the Human Genome• Genomics 13:1095-1107 (1992)

• Total number of entries: 1711• Entries with no islands: 1212• Entries with islands: 499• Total number of islands: 928

• The Length of CpG islands– Average size of islands: 465 bp– Shortest detectable island: 200 bp– Largest island: 3340 bp

Expression of gene Number Number associated with islands

Widespread 217 216 (99%)

Limited 719 261 (36%)

Page 12: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

a Snap Shot of Data set

Page 13: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Procedures

Fasta all-to-all Comparison

Clustering By BAG

MEME

MAST

BLAST

Clustering

Motif (Pattern) Discovery & Search

for each cluster

Database search with CpG islands patterns

Page 14: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Clustering

• We use a clustering program, BAG by Sun Kim

• We compare each CpG island to all CpG islands using fasta for the input of BAG

• BAG makes clusters based on sequence similarity

Page 15: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Motif Discovery & Search

• MEME discovers patterns for each cluster

• To see the significance of a pattern, MAST searches all CpG islands with the pattern

• We can see how significant the pattern is or how often the pattern occur according to E value

• Profiles are made to represent each cluster

Page 16: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Motif Discovery & Search

Page 17: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

BLAST

• The entire GenBank was searched with CpG island profile, not with Gene

• We see how efficiently the profile can find the genes that have similar function

• This verifies the validity of the profile

Page 18: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Results

• There are 26 clusters in which members have similar gene function among total 115 clusters

• These 26 clusters are divided into two categories depending on CpGi location– 18 clusters have CpGi’s in coding region– 8 clusters have CpGi’s in promoter region

Page 19: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Results

• One example from CpGi in body

• Cluster # 18 : Human heat-shock protein HSP70B' gene– Meme– Mast– profile sequence

ATCATCGCCAACGACCAGGGCAACCGCACCACCCCCAGCTACGTGGCCTT

– Blast

Page 20: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Results

• One example from promoter CpGi

• Cluster # 25 : Human gene for creatine kinase B– Meme– Mast– Profile sequence

GAGGAGTCCTACGAAGTGTTCAAGGATCTCTTCGACCCCATCATTGAGGA

– Blast

Page 21: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Gene & CpG islands in promoter region

cluster Description Acc No.

7 Human MAGE-4a antigen (MAGE4a) gene

U10687.1 U10687.3 U10687.4 U10687.2 U10687.5

14 Aldose Reductase gene M59856.1 L14440.1

25 Human creatine kinase M60806.1 X15334.1

72_73 Human metallothionein gene M10942.1(arti) J03910.1 M13003.1 K01383.1

79_80 Human gene for neurofilament subunit X05608.1(arti) X15306.1 Y00067.2

85 Phenylethanolamine N-methyltransferase gene

J03280.1 X52730.1

92 Human U1 small nuclear RNA pseudogene

M14387.1 M28010.1 M28011.1

96 Human trichohyalin (TRHY) gene L09190.1 L09190.3

Page 22: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Gene & CpG islands in CDScluster Description Acc No.

9 alpha 2 adrenergic receptor gene D13538.1 M23533.2 M34041.1 M67439.1(arti)

M83181.1 M28269.1 X13556.1

10 actin gene M19283.2 M20543.2

13 alkaline phosphatase gene J03252.1 J03930.1 M31008.2

18 Human heat shock protein M19645.1 ARTI M59830.1 M11717.1 X51757.1

32 Neurophysin gene X62890.1 M11166.1 M11186.1

41 Human v-erbA related ear-2 gene X12794.1 X12795.1

52 histone H1 (H1F4) gene X57130.1 M60748.1 X57129.1

53 histone H3 gene X57128.1 M60746.1 M26150.1

54 Human histone H4 (H4) gene X60482.1 X60483.1 X60484.1 X00091.1 X00038.1

M16707.1 M60749.1 X60487.1 X67081.1 X60486.1

56 serotonin receptor gene K02405.1 K02773.1 ARTI K01499.1 X02228.1 M77285.1

58 Human histone H2b gene M60751.1 X57985.1 X00088.1

59 Human histone H2a gene M60752.1 X00089.1

64 Human heat shock protein X03901.1 L39370.1

69 proto oncogene (JUN) J04111.1 M29039.1

87 Human beta-tubulin pseudogene X00734.5 J00315.1

90 H.sapiens gene for 28S rRNA V8 region X69341.1 X69358.1 X69357.1 M11167.1

91 Human POU daomain factor (Brn-3a) gene U10063.1 U10061.1

Page 23: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Discussion

• The blast result implies that both CpG islands in promoter region and in CDS are good markers for gene sequences

• Even though there are small numbers of promoter CpG islands, they represented their clusters significantly

• Since many CpG islands tend to cover exons, they can be used to identify transcripts

• Need more data to support this result and to make generic patterns

Page 24: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Acknowledgement

• Dr. Sun Kim

• Dr. Paul Ma

• Arvind

• Bioperl community

Page 25: Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School

Comments & Questions