gene structure - duke university structure previous lectures ... (same code used in all organisms,...

14
99 GENE STRUCTURE Previous lectures have detailed the chemistry of the DNA molecule, the genetic material, as well as the mechanisms for replicating and maintaining the integrity of the DNA. We now want to understand the functional aspects of DNA as the genetic material. How does this DNA molecule function as a gene, what is a gene, and how can we consider mutation in the context of the gene? The Genetic Code The DNA molecule is a linear array of nucleotides. In most cases, we think of the purpose of the genetic information as storing the information as a code for a linear array of amino acids that constitute a protein. It is also true, that some genes encode RNA molecules that function as RNAs and do not encode proteins. Examples include the ribosomal RNAs that form the ribosome structure as well as playing a catalytic role in protein synthesis, the transfer RNAs (tRNAs) that carry specific amino acids to the translation machinery, and small nuclear RNAs (snRNAs) that play a critical role in the processing of mRNA precursors by splicing. Most genes, however, function to encode proteins. Twenty amino acids are utilized in the synthesis of proteins. Therefore, since there are only four possible nucleotides, a single nucleotide cannot code for an amino acid. Pairs of two would also be insufficient. However, a group of three nucleotides would give the potential for 64 specific codons (4 x 4 x 4), thus sufficient to code for all amino acids. This is in fact the code - a triplet code such that every three nucleotides encodes one amino acid. Properties of the genetic code: - Read in groups of three - Unambiguous (a triplet codon specifies a unique amino acid) - Degenerate (more than one codon specifies an amino acid) - Stop codons

Upload: lyxuyen

Post on 31-Mar-2018

228 views

Category:

Documents


4 download

TRANSCRIPT

99

GENE STRUCTURE

Previous lectures have detailed the chemistry of the DNA molecule, the genetic material,

as well as the mechanisms for replicating and maintaining the integrity of the DNA. We now

want to understand the functional aspects of DNA as the genetic material. How does this DNA

molecule function as a gene, what is a gene, and how can we consider mutation in the context of

the gene?

The Genetic Code

The DNA molecule is a linear array of nucleotides. In most cases, we think of the purpose of

the genetic information as storing the information as a code for a linear array of amino acids that

constitute a protein. It is also true, that some genes encode RNA molecules that function as

RNAs and do not encode proteins. Examples include the ribosomal RNAs that form the

ribosome structure as well as playing a catalytic role in protein synthesis, the transfer RNAs

(tRNAs) that carry specific amino acids to the translation machinery, and small nuclear RNAs

(snRNAs) that play a critical role in the processing of mRNA precursors by splicing.

Most genes, however, function to encode proteins. Twenty amino acids are utilized in the

synthesis of proteins. Therefore, since there are only four possible nucleotides, a single

nucleotide cannot code for an amino acid. Pairs of two would also be insufficient. However, a

group of three nucleotides would give the potential for 64 specific codons (4 x 4 x 4), thus

sufficient to code for all amino acids. This is in fact the code - a triplet code such that every

three nucleotides encodes one amino acid.

Properties of the genetic code:

- Read in groups of three

- Unambiguous (a triplet codon specifies a unique amino acid)

- Degenerate (more than one codon specifies an amino acid)

- Stop codons

100

The genetic code is universal (same code used in all organisms, both prokaryotic and

eukaryotic) with one exception - there are a few differences in the code used in mitochondria.

? UGA is not a stop signal but codes for trypophan. The mit tRNAtrp recognizes both UGG and

UGA, obeying traditional wobble rules.

? Internal methionine is encoded by both AUG and AUA; initiating methionines are specified

by AUG, AUA, AUU, and AUC.

? AGA and AGG are not arginine codons but are stop codons. Thus, there are four stop codons

(UAA, UAG, AGA, and AGG) in the mitochondrial code.

A mutation is any heritable change in the genetic material, resulting in an alteration of the DNA

sequence. Mutations are usually considered in the context of a change that alters gene function

and thus the phenotype of the organism. Understanding the molecular mechanisms responsible

for mutations, either simple changes in DNA sequence or more drastic deletions, insertions, or

rearrangements of DNA material, as well as the mechanisms responsible for recognizing and

101

correcting these alterations, is of central importance to the understanding of disease

mechanisms.

The Nature of Mutations –Mutations, which can alter the coding properties of a DNA segment,

are of several types:

A. Substitution mutations convert one type of base pair into another. G-C to A-T and A-T to

G-C changes are referred to as transition mutations (replacement of a purine to pyrimidine

base pair by a purine to pyrimidebase pair). G-C to C-G, G-C to T-A, A-T to T-A, and A-T to

C-G are called transversions (replacement of a purine-pyrimidine base pair by a pyrimidine-

purine base pair). Although transitions are more common than transversions, both kinds of

mutations occur as a consequence of replication errors, both can result from chemical

damage to DNA, and both have been implicated as causative factors in inherited genetic

disease and cancer. Single nucleotide changes can change the codon to that of another amino

acid, thus altering the protein. In addition, such changes can also create a stop codon.

B. Small insertions/deletions comprise a second relatively common class of mutation.

Genetic changes of this sort involve insertion or loss of a small number of contiguous base

pairs (one to several hundred). Repetitive runs of a mono, di-, or trinucleotide sequence are

extremely prone to insertion/deletion mutation, an effect that has been attributed to slippage

of template and primer strands during replication:

102

Repeat elements like (CA) n shown above or the (A) n element, which contains a run of

adenine residues on one strand paired with a run of thymine bases on the other, are very

common in human chromosomes. For example, about 50,000 (CA) n repeats are distributed

throughout the human genome, with each repeat element typically containing 10 to 60

copies of the (CA) dinucleotide (eg., n = 10 to 60). Due to their propensity to slip during

DNA biosynthesis, repetitive sequences are particularly prone to mutation. As described

below, a high incidence of mutations in (CA) n microsatellite sequences is a valuable

diagnostic for certain human malignancies.

Deletions or insertions will result in a frameshift if it is not a multiple of three base pairs.

DNA Sequence Protein Sequence Type of Mutation

ATG AAA TTT TGT CGT AAA MET LYS PHE CYS ARG LYS Wild type

ATG AAc TTT TGT CGT AAA MET asn PHE CYS ARG LYS Missense

ATG AAC TTT TGa CGT AAA MET ASN PHE stop Nonsense

ATG AAA T TGT CGT AAA MET LYS leu ser stop Deletion/Frameshift

Definition of a Gene

Traditionally, a gene has been defined as either the unit of heredity or defined as that

portion of a chromosome encoding a functional RNA or protein. Although these two views

generally coincide in the case of prokaryotic genes, the situation is much more complex in the

case of a eukaryotic gene. A prokaryotic gene is relatively simple in structure, including the

coding sequence to specify the synthesis of a protein and a minimal amount of regulatory

sequence to control the expression of the gene. In contrast, a eukaryotic gene can be vastly more

complex and can occupy large regions of chromosomes. This is due to the fact that most

eukaryotic genes, particularly those in mammalian cells, are discontinuous. That is, coding

regions are often separated by non-coding sequence. More importantly, the regulatory sequences

that are responsible for the expression of the gene can be complex and separated by large

distances from the actual gene sequence. Since a gene must ultimately be defined in a phenotypic

sense, then the expression of the gene is critical - phenotype is determined not only by the

sequence of a particular protein but also by the ability of a given cell to express that protein. In

consideration of all of these issues, the definition of a functional gene would be those DNA

sequences necessary to achieve the normal expression of the gene product.

103

Obviously, the ability to precisely define a gene is critical for an understanding of the

basis of gene function, including the nature of sequences important for the normal, regulated

expression of the gene. In addition, a knowledge of the gene structure and sequence is critical

for evaluating and understanding the molecular basis for gene mutation that underlies a disease

state.

In addition, we will see later that a knowledge of the characteristics of a gene, including

those sequences that define open reading frames, splice site signals that define exon/intron

junctions, and the sequences that constitute transcription regulatory signals, is critical in the

search for an unknown gene

Isolation and Study of a Eukaryotic Gene:

How does one go about studying the structure and characteristics of a particular gene.

Since there are approximately 3 x 109 base pairs in the human genome, and any given gene may be

no more than 104 base pairs, analysis of the total population of human DNA is impossible.

Clearly, a gene must be isolated apart from the total DNA and amplified to allow a detailed study.

Early studies of gene structure and function in eukaryotic cells made use of animal viruses,

particularly the so-called DNA tumor viruses including adenovirus and polyomaviruses, as a

mechanism to isolate and study individual genes. Viruses simply represent a relatively small set

of genes packaged in a protein coat. The ability to isolate and purify viruses thus provided a

mechanism to isolate pure populations of specific genes. Since these viruses make use of

cellular activities for the expression of the viral genes, the basic aspects of viral gene structure

and function generally reflect that found for cellular genes. Thus, the initial studies of these

viruses laid much of the groundwork for subsequent analysis of cellular genes.

With the advent of molecular cloning through recombinant DNA procedures, it then

became feasible to isolate individual eukaryotic genes. Cloning allows both the purification

(isolation) of the gene, away from all others, as well as the amplification of the gene to provide

sufficient material to carry out biochemical analyses.

We have already discussed the procedures involved in the generation of a library of

clones and the methods for detection of a particular clone by hybridization with a radioactive

DNA probe of complementary sequence. But, where does the probe come from? Let's take the

globin gene as an example. Standard biochemical procedures for protein purification can yield

sufficient amounts of pure preparations of the globin protein to carry out analysis. Protein

chemistry can then generate limited amino acid sequence from the purified protein. The amino

acid sequence can then be used to predict the nucleotide sequence of the gene based on a

knowledge of the genetic code. There will be some ambiguities since the code is degenerate (an

amino acid can be encoded by multiple codons) but a mixture of DNA probes could be

104

synthesized, one of which would correspond to the actual globin gene sequence. This mixture of

probes could then be employed to screen the library of clones to identify the one that carries the

globin gene.

An alternative approach would be to generate an antibody that is specific to the globin

protein by injection of the purified protein into rabbits or mice. Once an antibody has been

generated, it can then be used to screen an expression library for a clone that encoded a portion

of the globin protein. In this case, the expression library, likely in the form of a bacteriophage

lambda vector, would generate fusion proteins within the phage-infected cells in which a portion

of the ß-galactosidase protein was fused to random sequences that were cloned. If one of these

carried the globin sequence, and if this was a portion of the globin protein that was recognized by

the antibody, then by screening plaques with the antibody followed by a secondary procedure that

would detect antibody bound to the filters, essentially the same procedure used for Western blot

analysis, one could identify those clones in the library that carried DNA sequences encoding

globin protein.

Discontinuous Nature of a Eukaryotic Gene

105

The ability to isolate a gene such as that encoding the ß-globin protein finally allowed

detailed molecular analyses to be undertaken of the structure of a eukaryotic gene. This led to

the startling discovery that most eukaryotic genes are discontinuous. That is, sequences that are

found in the messenger RNA are not contiguous in the DNA. This fact was initially discovered

with the DNA virus adenovirus by hybridization of a viral messenger RNA to the viral DNA

genome and finding that segments of the DNA were looped out due to the fact that the sequences

in the RNA were not contiguous with the DNA. Subsequent studies with a variety of cellular

genes revealed that this event was a common property of most genes. Moreover, the analysis of

cellular genes revealed that actual coding sequences were interrupted by intervening sequences.

Shown above is an analysis of the structure of the ovalbumin gene, relative to the mRNA product,

as observed by forming a hybrid between the mRNA and the genomic DNA and then examining

the resulting structure in an electron microscope (top figure). The DNA has been denatured and

the single stranded portion can be seen extending at the ends. The schematic shown in the middle

106

represents a tracing of the electron micrograph to indicate the structures. The double stranded

regions hybridized to the

mRNA (designated 1 - 7) represent genomic sequences that are preserved in the mRNA whereas

the looped out regions (A through G) define genomic sequence not present in the mRNA

sequence. The deduced structure of the gene, showing exons and introns, is depicted at the

bottom.

This analysis, and a previous one using adenovirus, defined the exon/intron structure of

eukaryotic genes and the fact that the mRNA was processed from an initial precursor.

Structure of a Typical Eukaryotic Gene

As a result of the analysis of a large number of eukaryotic genes, essential features of

gene organization have become evident. Most striking, when compared to the organization and

structure of prokaryotic genes, is the extreme complexity of the genes in eukaryotic cells,

particularly higher eukaryotes. This often involves a large array of discontinuous segments that

must be assembled into the final mRNA product as well as a complex array of regulatory

sequences that govern the transcription of the gene.

Exons - sequences in the gene that are found in the functional mRNA. Includes coding sequence

but may also include non-coding sequence. The beginning of the first exon defines the site of

initiation of transcription since there is no processing of the 5' end of the primary transcript. The

end of the final exon defines the site of cleavage of the primary transcript at what is known as the

107

polyadenylation site, creating the mature mRNA 3' terminus. Transcription does not terminate at

this position but continues some distance downstream.

Introns - intervening sequences in the gene that are removed in the formation of the functional

mRNA. Usually includes non-coding sequence but there are instances of alternative processing

where sequences can be both introns and exons.

This arrangement can vary from relative simple (two exons separated by one intervening

intron sequence) to extremely complex whereby a very large number of exons form the final

mRNA. For instance, the dystrophin gene, that which is mutated in Duschenne muscular

dystrophy, comprises at least 70 exons and more than one million base pairs of DNA.

With respect to considerations of mutation frequency, one might suspect that the very

large domains of certain genes contributes to an increased frequency of mutation, by creating a

larger target size for mutation. Although much of the size is due to intervening sequence, the

mutation of which would have no consequence, it is also possible that an actively transcribed

domain would be more susceptible to mutagenic events.

Complexity of Gene Organization in Metazoan Organisms

As a result of the ability to analyze and study gene structure and organization in higher eukaryotic

systems, including humans, it is now apparent that a much increased complexity of gene

organization as well as gene structure can be seen when compared to the lower eukaryotic

organisms such as yeast. One example is seen in the generation of large, multi-gene families

such as the globin gene cluster shown below.

In addition, it is also evident that the exon-intron structure of various genes is highly

complex. Certainly, in some cases it provides flexibility in gene expression. Alternative splicing

can create distinct gene products from one loci. It is also possible that this is a mechanism for

108

evolution of function. In many cases, exons appear to encode protein "domains" - independently

functioning units of a protein. Thus, through a process of exon shuffling, new proteins could be

created from parts of others.

Unequal crossing over as a mechanism for gene duplication as well as exon

shuffling

The analysis of eukaryotic gene structure has revealed the evolution of gene families

resulting from gene duplication and subsequent evolution of sequence. It is likely that unequal

crossing over during meiosis, involving inappropriate pairing via repetitive DNA sequences, is

responsible for generating this complexity. As an example, consider the globin locus as shown in

the figure below. In this case, the two chromosome homologues mis-align as a result of pairing

of a repeated sequence known as Alu that is found frequently along each chromosome. A

recombination event then leads to the formation of a chromosome with a duplication of the

globin gene and a second chromosome in which the globin gene has been deleted. The gamete

receiving the chromosome with the deleted region will not give rise to a viable progeny but the

other chromosome will now carry two copies of the globin gene.

The same event can give rise to changes in gene structure. For instance, consider a region

of chromosomal DNA that contains five functional sequence blocks (genes or exons) as shown

109

above. Normally, precise alignment of the two chromosomal pairs would insure that crossing

over events would not change the overall organization. If, however, the two chromosomes were

to mis-align as a result of pairing between repeated sequences that might be found interspersed in

the chromosome (solid circles), then unequal crossing over would leave one product with a

duplication of a gene (or exon) and the other with a deletion.

The fact that intron sequences, as well as intergenic sequences, are essentially non-functional

means that there is great flexibility in the capacity to re-arrange genes and exons due to

recombination. That is, there is no requirement for a precise break and rejoining event if the

basic functional unit (entire gene or exon with necessary splice signals) is maintained.

110

Comparison of the sequences/structure of the albumin and alpha-fetoprotein genes

illustrates the process of gene evolution

Cloning of the albumin and alpha-fetoprotein genes, and subsequent DNA sequence

analysis, has revealed considerable sequence similarities between the two genes and also within

each gene. In particular, the sequence and arrangement of exons 3-6 is similar to that of exons 7-

10 and exons 11-14 for both genes. As such, it has been proposed that a primordial gene may

have given rise to the present day albumin and alpha-fetoprotein genes as the result of an initial

triplication of this group of four exons followed then by a duplication to create the related genes.

111

Functional Immunoglobulin Genes and Related T Cell Receptor Genes Are

Created in Somatic Cells by Genomic Rearrangements

Virtually every cell in the human body or any organism possesses the exact same

complement of genes. Thus, even though the gene encoding albumin is only expressed in the

liver, the exact same gene is also found in the brain. Clearly, the control of gene expression is a

critical event in determining cell phenotype and we will return to the basis for this regulation

later. There are, nevertheless, at least three exceptions to the rule that every cell contains the

same set of genes. First, the random events of somatic mutation generate changes in genes that

will be unique to the cell in which the mutation occurred. This is an ongoing process that for the

most part is of no consequence. In certain instances, however, it does alter the function of a gene

- perhaps the best example is the somatic mutation within the variable region of the

immunoglobulin genes that generates a large part of the diversity of the antibody repertoire.

Second, mutations, deletions, or chromosome alterations that affect key cellular

regulatory genes can initiate and contribute to the process of oncogenesis. Mutations can

activate oncogenes to a constitutively active, unregulated state. Deletions can eliminate genes

that normally function to negatively control cellular proliferation. Chromosome rearrangements

can generate novel, chimeric genes that possess unique properities.

Third, the genes encoding the immunoglobulin molecules as well as the T cell receptors

must undergo somatic recombination events in order to create a functional gene. Indeed, these

recombination events, together with enhanced somatic mutation of the genes, are responsible for

creating the tremendous diversity of the immunological response. Moreover, as indicated above,

112

somatic mutations within the V and J segments of the immunoglobulin genes, as well as errors in

the recombination events, also contribute to the generation of antibody diversity.

Experimental evidence for immunoglobulin gene rearrangement:

A Southern blot assay of germ cell DNA or embryo DNA versus DNA from an antibody-

producing B cell, in this case a B cell tumor (myeloma), reveals a distinct difference in the

organization of the immunoglobulin heavy chain gene as reflected by differences in the Southern

blot analyses as shown in the figure below.

For instance, if one was analyzing the structure of the gene in the region of the J segments, where

recombination was occurring, a Southern could detect a change in the genomic organization as

indicated by a new band. If this same assay was carried out using any other tissue source other

than from a B cell, the same pattern of immunoglobulin gene arrangement as found in the germ

line would be obtained. Moreover, if one assayed for virtually any other gene sequence one

would find the same pattern in the germ line as found in a somatic cell. Thus, there is a B cell

specific alteration in the structure of the immunoglobulin DNA.