modeling splice site and transcription factor binding site variation by information theory

Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Peter K. Rogan, Ph.D.

St. Jude’s Children’s Research Hospital

Memphis, TN

May 15, 2003

• Information theory provides general solutions to the problem of how to recognize members of a group of related nucleic acid (or protein) sequences.

Background

• The average information of a related set of sequences, Rsequence, represents the total sequence conservation:

Rsequence = 2 - [ -f(b,l) log2 f(b,l) + e(n(l)) ]

f(b,l) is the frequency of each base b at position l,

e(n(l)) is a correction for the small sample size n at position l

Schneider et al. J. Mol. Biol. 1984

Sequence Logo

Conservation and diversity among related binding sites can be visualized using a sequence logo.

The area under the logo isRsequence, the average Information of the binding site.

Definition of Individual InformationDefinition of Individual Information

• The individual information, Ri, of a single member of a sequence family is the dot product of that sequence vector and a weight matrix, Ri(b,l), based on the of the base frequencies at each position of the sequence.

Ri(j) = s(b,l,j) Riw(b,l) (bits per site j) l b=a

Distribution of Individual Information for related binding sites

The average of the set of Ri values for a family of sequences is Rsequence.

Second law of thermodynamics

-kBT ln 2 q / R

q: heat dissipated; T: temperature; R: information

HLH Protein HLH Protein bound to WT DNA

q < 0 => R > 0

DNA Mutation orUnrelated sequence

q > 0 => R < 0

Among related sequences having a common function,functional sites can be distinguished from non-siteswith the sequence walker. (E. coli Fis protein)

Sequence Walker Definition

Sequence Walker Application I

The matrix can be scanned along a “test sequence” until...

Ri = - 6.7 bits at position 179 of the sequence. The Z score is -5.4.

Sequence Walker Application II

… a green bar indicates a potential binding site

Ri = 9.2 bits at position 180 of the sequence. The Z score is 0.3.

mRNA splicing

1 2Exons

Transcription

Splicing

donor acceptor

5’3’

Mature mRNA

Alternative mRNA

Splice Site Model Building

•We extracted coordinates of unique donor and acceptor splice sites of known genes from the given strand of the 10/7/00 Human Genome Working Draft.

•Valid splice junctions were evaluated by information theory (Ri > 0) and the Ri(b,l) matrix was computed.

•This process was iterated (~ 10 cycles) until all sites evaluated with the matrix had Ri > 0.

Parameters Acc (+ strand) Acc (- strand) Acc_total Acc (1992)

Starting set (n) 86,068 84,076 170,144 1,744Refined model (n) 53,985 54,101 108,079 1,744 Site coordinates [-25, 2] [-25, 2] [-25, 2] [-25, 2]Rsequence 7.45 7.41 7.42 8.87Standard deviation 3.47 3.47 3.47 4.58Ri of consensus sequence 22.93 22.78 22.88 21.68

Don (+ strand) Don (- strand) Don_total Don (1992)Starting set (n) 86,221 84,229 170,450 1,799Refined model (n) 56,286 55,491 111,772 1,799 Site coordinates [-3,6] [-3,6] [-3,6] [-10,10]Rsequence 6.73 6.74 6.74 8.01Standard deviation 2.36 2.33 2.34 3.29Ri of consensus sequence 11.80 11.80 11.79 15.18

Semi-automated Splice site Model Refinement

• ~ 1/3 of exon-intron junctions are misaligned in the draft, owing to the rapid alignment procedures used (ie. BLAT).

Splice junction logos: (+) strand

Ri analysis of sequence variation at binding sites

• Effects of mutations

• Effects of polymorphisms

• Detection of cryptic sites

• Relationship between information content and phenotype

Comparison of the binding energies of normal andvariant splice junctions:

Gwt/ Gv = 2Ri

where Ri = the difference between the respective Ri values, Gwt = Free energy of the natural binding site,

Gv = Free energy of the variant binding site.

The fold difference in binding the normal vs. the variant site isGwt/ Gv.

mRNA splicing mutations (*, ^)gene

1 2Exons

donor acceptor

5’3’

Leaky or no wild typemRNA

or1 3Exon

skipping (*)

or1 2 3

Crypticsplicing (^)

Mutant forms

The minimum information required for donor siterecognition

Temperature sensitive mutation in COL3A1 results in 50% exon skipping and Ehlers-Danlos syndrome, Type VII. Splicing is impaired at 39 deg.Cand restored at 30 deg. C, which is consistent with weak binding by U1 splicesome.

Cryptic splicing mutations

A C->T mutation in exon 3 of the iduronidatesynthetase gene activates a cryptic donor site upstream of the natural donor site.

Mechanism of exon recognition

5’ mRNA 3’

acceptor donor

U2 splice + U2AF

U1splicesome

Binding sites

Mechanism of exon recognition: cryptic splicing mutation

5’ mRNA 3’

Naturalacceptor

Naturaldonor

U2 splice + U2AF

U1splicesome

Binding sites

Activatedcrypticdonor Either not

recognized or to lesser degree

Recognized

Mild (or leaky) splicing mutation

Splicing among 3 common alleles that differ in length in the polymorphic polythymidine tract of the IVS 8 acceptor of the CFTR gene.The shortest allele (top walker) shows 90% skippingof exon 9 and is associated with congenital absence of the vas deferens. Individuals with the two longer alleles have a normal phenotype, although the 7T allele produces less mRNA than the 9T allele.

CFTR Polymorphism (5T, 7T, 9T)

PopFreq

Prediction of clinical phenotypes

•Hereditary non-polyposis colon cancer•Hemophilia A and B•Atherosclerosis

The Lynch I form of HNPCC is confined to the colon, but the more severe Lynch IItype shows multi-organ involvement. The HNPCC phenotype is hypothesized to berelated to the amount of normal and abnormal MLH1 and MSH2 mRNA presentpredicted from the individual information in mutant splice sites.

Mutant splice sites (n=31) in these genes contained significantly less information than thecognate natural sites. Each of the Lynch I mutations had R i values >2.4 bits, which isconsistent with reduction (not abolition) of mRNA. Lynch I and II phenotypes weredistinguishable by their Ri values for all but 3 Lynch II mutations (with 2.4 to 4.8 bits).

Predicting Phenotype of HNPCC Splicing Mutations by Information Analysis

Lynch I mutations

Lynch II mutations

Hypothesis: Ri values will be highest for normal splice sites,Intermediate values for Lynch I and lowest values for Lynch IIsyndrome.

The medians for these three groups are different and in the correct orderand that there are some outliers in the two Lynch mutation groups.The three groups have significantly different RI values.{Kruskal-Wallis 2 (df=2) =17.9833 P= 0.0001}

Each of the groups are different from one another based on pairwisecomparisons with the Wilcoxon rank-sum test:

Group comparison Corrected Rank-sum P Normal (Z) statistic________________

Lynch I vs. Normal variants 2.68 0.0072Lynch II vs. Normal variants 3.73 0.0002Lynch I vs. Lynch II 2.17 0.03

Statistical analysis: HNPCC

Results are consistentwith MSH2 -/-and MSH2 +/- transgenic mouse phenotypes. Increased proliferation induces widespread DNA replication errors, whichare repair normally until DNA repair systems are saturated (Cancer Res.62:2092, 2002).

Mismatch repair machinery is activated byDNA damaging agents(Nature 399:806, 1999;PNAS 96:10704, 1999).

Relating Information Content of F8C and F9

Splicing Mutations and Bleeding Phenotypes

To predict severity of hemophilia, mutationsin the factor VIII (F8C) or factor IX (F9) geneswere analyzed for changes in RI:

The receiver operating curve discriminatedmildly or moderately from severelyreduced protein activity for values 2.4bits or Ri < 7 bits (P=.001).

Using these thresholds:- 91% of mutations with severely

reduced protein expression werecorrectly identified (n=45; P< 0.001).

- 86% of mutations associated withsevere bleeding and all mutationswith moderate bleeding symptomswere correctly identified (n= 22 p< .0009).

Information Content of Splicing Mutations in Lipid Metabolizing Genes vs. Phenotype

Ri value

cutoff(bits)

Phenotype*

Dyslipidemia Reduction in protein level or activity

Mild Average Severe Mild Average Severe

> 2.4 2/5 3/5 0/5 2/3 1/3 0/3

Fraction is the number of mutations in category / total number above or below 2.4 bits. Mutant

genes included APOAII,APOB,APOCII,APOE,CBS,CETP,LCAT,LIPA,LDLR, and LPL.

Generating information models of eukaryotic transcription factor cis-regulatory binding sites

Unique challenges:

•Variant sequences are not obvious•Requires experimental determination and validation •Effect of ascertainment bias

in published sitesin SELEX-generated sites

•Binding protein does not necessarily signify that it activates (or represses) transcription

(A) Mutation in the CCAAT box of the A-gamma globin gene results in 1.4 fold increased expression of fetal globin mRNA into adulthood. The CCAAT box protein binding site is strengthened by 0.5 bits (or 1.41 fold) over wild type. (B) The binding site logo and distribution of Ri values of 171 binding sites in the Transfac Database (www.biobase.de) are indicated. Models of NF-E2, GATA1, and GATA2 protein binding Sites were also constructed, but sites were not found in this interval (not shown).

Greek Hereditary Persistence of Fetal Hemoglobin(HBGA, -119G>A)

6.8 bits

7.3 bits

The Transcription Factor Binding Site Problem:

Bias in Models Derived from TRANSFAC datatowards Consensus Sequences*

*Consensus sequences have the strongest binding, but are often not representative of the majority of sites.

Model development strategy

Refinement of the Pregnane X Receptor (PXR/RXRα) binding site model

Initial PXR/RXR Model. Published PXR/RXR binding sites (n=15; and flanking sequences) were multiply aligned by minimization of uncertainty. The -2 to +20 interval contained most of the information, was consistent with published binding studies, and was therefore used to define the site.

Competition Curves for Novel PXREs Identified by Model 1

To quantify the relative affinity of PXR/RXR, band density was plotted versus pmol competitor to determine the concentration of competitor required to deplete PXR/RXRα binding to the CYP3A4 proximal PXRE by 50%. Relative binding was normalized to the band intensity of the reactions with no added competitor as 100%.

Comparison of predicted and measured binding affinities for novel PXR/RXRα sites

identified with the initial model

Predicted fold differences in binding were closer to densitometrically-determined differences when these weaker sites were added in Model 2.

RI (bits)MinimumTheoretical Changein AffinityGENE

Position(relativeto ATG)

PXRE(Model 2 derived walker)

Model1

Model2

Model 1 Model 2

ObservedChange inAffinity(EMSA)

CYP3A4 -270 17.3 18.0

CYP2B6 -8572 15.0 17.9 4.92 1.07 4.4

UGT1A3 -6930 10.9 17.2 84.4 1.74 4.4

UGT1A3 -8040 10.7 16.5 97.0 2.83 3.7

UGT1A6 -9216 9.9 14.3 168.9 13.0 29.6

RI (bits)MinimumTheoretical Changein AffinityGENE

Position(relativeto ATG)

PXRE(Model 2 derived walker)

Model1

Model2

Model 1 Model 2

ObservedChange inAffinity(EMSA)

CYP3A4 -270 17.3 18.0

CYP2B6 -8572 15.0 17.9 4.92 1.07 4.4

UGT1A3 -6930 10.9 17.2 84.4 1.74 4.4

UGT1A3 -8040 10.7 16.5 97.0 2.83 3.7

UGT1A6 -9216 9.9 14.3 168.9 13.0 29.6

(A) Alignment of published + validated PXREs

(B) Histogram (C) Sequence logo

Model 2 Characteristics

Scans of CYP3A4 and CYP2B6 promotersEach promoter was scanned with PXR/RXR model 2. Ri values are plotted versus the position of the PXRE in the CYP3A4 gene or the CYP2B6 gene. Ri values of sites on the antisense strand are shown upside down. Previously characterized PXR binding sites identified by the model are indicated in color.

Activation of the CYP2B6 Distal PXRETransient transfections with CYP2B6 and control CYP3A4 PXRE fusion constructs. Rifampin induced luciferase activitiy 4- to 5-fold in cells cotransfected with an expression plasmid for human PXR and CYP2B6-dPXRE(2X)-luc, and 2- to 3- fold in cells cotransfected with CYP3A4-pPXRE(2X)-luc. Rifampin had no effect on luciferase activity in cells transfected with the enhancerless-reporter.

Average luciferase activity ± SD of three replicates from 3 independent transfections is shown.

PXR/RXR Model 3

Weaker binding sites from well established PXR/RXRα target gene promoters (Ri < Rsequence) were validated and introduced into Model 3.

Novel validated binding sites in Model 4

Site name Site name - Ri(b,l) matrix Ri

CYP3A4-pPXRE(0/10G) NG_000004.a148729g.a148739g 15.1

CYP3A-dNR1(0/10G) NG_000004.t141178c.t141168c 16.8

CYP3A7-dNR2(0/10G) NG_000004.a190205g.a190215g 17.6

CYP2B6-dPXRE(10G) CYP2B6.a1446g 16.2

UGT1A3b(0/10G) AF297093.t137695c.t137685c 18.3

UGT1A3a(0/10G) AF297093.a138805g.a138815g 14.9

GSTM1(0/10G) AC000031.6.a1959g.a1969g 12.0

UGT1A1gtNR1(0/10G) AF297093.1.t171676c.t171666c 7.1

UGT1A1b(0/10G) AF297093.1.t165761c.t165751c 14.0

FMO4b(10G) AL031274.1.a57947g 11.0

catalase(0/10G) AL035079.14.t43503g.a43513g 14.6

NOS2A(1A) chr17_27002541-27012540.c8336t 12.9

NOS2A(11A) chr17_27002541-27012540.c8326t 10.5

MAOBd(0/10G) Z95125.t36576c.t36566c 11.1

These 14 binding sites are not present in the Nov 02 human genome draft!

Possible significance of novel sites

• Not present in reference sequence, but they are polymorphisms or mild mutations– Advantage is that binding is not abrogated, but

reduced, ie. gene is less PXR/RXR responsive. – Possible “wobble” code for regulatory elements

• Ancestral binding sequence present in primate lineage– PXR/RXR mutation rate is slower than cis-regulatory

element; protein retains ability to recognize sequences that are no longer present

– This could explain why heterologous cross-species transfections are faithfully regulated.

Development of a Xenobiotic biosensor based

on the information theory-derived optimal site

PXREv2-OPT(2X)-luc

CYP3A4-pPXRE(2X)-luc

LU DMSO

10 uMRifampin

HepG2 cells were transiently transfected with 100 ng luciferase reporter, 5 ng pRL-CMV and 25 ng pSG5-hPXRDATG with Lipofectamine Plus. After treatment for 24 hours with 10 mM Rifampin or 0.1% DMSO (solvent), cells were harvested and Dual-luciferase assays were performed. Results are the average of three separate wells transfected and treated in parallel.

Architecture of the Delila Genome System

Performance metrics

Histogram of binding site strengths for sites in genome scan >10 bits

Delila-Genome Visualization Tools

Visualization of successive genome scans of PXR/RXRα binding site models

Monitoring PXR/RXR refinement through complete genome promoter scans

Development and Experimental Refinementof NFkB p65/p50 Binding Site Model

Panel 1. Logos for NFkB p50/p65 binding sites. (A) Model 2 based on 55 Published and 8 experimentally determined binding sites (B) Model 3 based On 55 published and 20 experimentally determined binding sites. Inset s are histogram distributions of Ri values of sites comprising each model.

CYP2D6 Promoter Mutation Analysis of NFkB p65/p50 Binding Site

CYP2D6:

“C allele” 3.3 bits

“G allele” -0.8 bits

The -1496C allele contains a weak p50/p65 site (–1495 to –1508; R i =3.3 bits) that isabolished (R i < 0) in the G variant. These alleles each also contain p50 homodimerbinding sites on opposite strands; however, the C allele is predicted to bind withgreater affinity (3.5 vs. 2.7 bits; 1.6 fold difference). The higher CYP2D6 activityobserved for the –1496G allele may be due to reduced binding and repression ofCYP2D6 expression by NF-kB p50 homodimers.

Future efforts

• Automate binding site validation

• Genomic signature of PXR/RXRα – target genes

• (Hypothesis-based microarray studies of ligand-induced gene expression)

Automated binding site validation: microtiter plate immunoassay

• Covalently link reference oligo to plate• Bind synthetic PXR/RXRα ± competitor oligo*• Bind 1o RXR α (or PXR) antibody • Detect with 2o antibody/ HRP• (Automated with Biomek 2000 workstation)

*Competitor oligos are detected in PXR/RXRα target genes and exhibit Ri values that are ±2 bits of reference oligo.

Genomic analysis to identify genesregulated by transcription factors:

•Requires robust binding site model•Genomic signature should delineate differences between regulated and constitutively expressed genes:

• Define promoter interval interval • Binding site strength• Densities of sites• Organization of sites

0-2000-4000-6000-8000-10000

Position

Genes regulated by NF-kB + unregulated

Legend

Ri-reg

Ri-unreg

Position

-10000 -8000 -6000 -4000 -2000 0

“NF-kB binding sites” in gene promoters

-400 bp

-400 -350 -300 -250 -200 -150 -100 -50 0

Position

NF-kB binding sites for promoters of upregulated genes scanned by model 3

Legend

INF-beta

E-Selectin

Lymphotoxin

TNF-alpha

GM-CSF

Urokinase

R = 4.0i

NF-kB Binding Sites in Upregulated Genes

-400 -350 -300 -250 -200 -150 -100 -50 0

Position

Legend

“NF-kB binding sites” in genes not known to be regulated by NF-kB

R = 1.3i

Criteria for scanning chromosomes 21/22 with NF-B Model 3:

•Average information threshold of >4 bits. Of 548 promoter intervals (400 bp each): the mean Ri values for sites in 138 promoters on the transcribed strand and 137 on the antisense strand had sites exceeding threshold. 37% of the genes on chromosome 21 would be NF-B targets!! Also, multiple weak binding sites with low Ri values can falsely exclude genes containing strong binding sites. This genomic signature has very LOW specificity.

•Eliminate promoters with only weak binding sites (Ri<Rsequence). This signature identifies smaller set of genes: 11 and 19, respectively, on chromosomes 21 and 22. Several expected cytokine genes are not identified with this genomic signature. These criteria introduce biased towards the consensus sequence (or an incomplete model). This approach appears to lack adequate sensitivity.

True Positives

True NegativesUnknowns

Promoter region inputTraining/Validation

Prediction

Freq Dist of Binding StrengthsDistances from TSS

Markov Cluster Algorithm

Clusters of Sites

Hybrid Neural Network

Positive/Negative Prediction

Experimental ConfirmationPositive

Negative

Genome Scan

Genomic signature determination for PXR/RXRwith machine learning approach

Predictions of Binding Strength Network

• Network Input: Frequency distributions of binding sites based on 5 bit-wide bins

• Trained with 15 PXR/RXR responsive and 15 non-responsive promoter regions

• Results of testing 9 positive and 22 negative promoter regions:– <TP,FP,TN,FN> = <7,4,18,2>– Sensitivity = 77.8%– Specificity = 81.8%

In conclusion...

•Genetic variation in binding sites can be comprehensively modeled by information theory.

•Information is related to binding energy and can be used rank order binding strengths.

•Beware of experimental bias towards strong bindingsites. Information theory can be used to develop and refine binding site models that are representative of the range of binding strengths found in the genome.

•Robust binding site models are a prerequisite for accurate mutation/polymorphism analysis and for comprehensive identification of binding sites in the genome.

ContributorsChildren’s Mercy Hospital and Clinics:•Sashidar Gadiraju, Stan Svojanovsky•J. Steven Leeder, Carrie Vyhlidal, Ivy Hurwitz

SICE, University of Missouri-Kansas City:•Deendayal Dinakarpandian, Saumil Mehta

St. Jude’s Children’s Research Hospital: •Erin Schuetz

University of Hamburg: •Yskert von Kodolitsch

NCI: •Tom Schneider

SupportMerck Genome Research FoundationPHS ES10855-02

modeling splice site and transcription factor binding site variation by information theory

Documents

evidence that u5 snrnp recognizes the 3 splice site for...

subtypes of associated protein-dna (transcription...

noncanonical registers and base pairs in human 5′...

phenotypes of stop codon and splice site rhodopsin mutations

every site counts: submitting transcription factor-binding...

determination transcription initiation site and

inhibition of hiv-1 transcription and replication by a...

splice site recognition by the splicing...

eukaryotic & prokaryotic...

a phosphorothioate at the 3' splice-site inhibits the second...

genome sequencing reveals a splice donor site mutation in...

switch in 39 splice site recognition between exon

splice-site mutations identified in pde6a responsible for...

u2af1 mutations alter splice site recognition in...

data report: revised pleistocene sediment splice for site

identification of a novel invariant splice site

splice site selection in the proteolipid protein (plp) gene...

splice-site a choice targets plasma-membrane ca -atpase

role of the branch site/3'-splice site region in...

structure function and splice site analysis of the...