chapter 2 protein sorting and thansport -...

25
Chapter 2 Protein Sorting and Thansport We are our proteins. Russell F. Doolittle Cell is the basic unit of life and proteins are the biological workhorses in the cell. For a protein to perform its function correctly, it should be located into its intended organelle. This chapter, in brief, discusses the biology of protein localization, Le. protein sorting and translocation. A background knowledge of cell, its organelles and amino acids are essential to comprehend the protein sorting. The beginning part of the chapter presents this back- ground information. The chapter then progresses with a general discussion on proteins and how they are synthesized in the cell. This is followed by a detailed description of protein sorting and translocation. This section deals with the major organelles of protein localization and how the organelles rec- ognize their native proteins. In addition to these, the chapter also mentions about the wetlab techniques employed for identifying protein localization. The chapter is ended by highlighting the need for the computational predic- tion of protein subcellular localization. 2.1 Background Biology Knowledge of cell, different types of cells and amino acids are indispens- able in understanding protein sorting and translocation. Proteins localize to 13

Upload: others

Post on 11-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Chapter 2

Protein Sorting and Thansport

We are our proteins.

Russell F. Doolittle

Cell is the basic unit of life and proteins are the biological workhorses in

the cell. For a protein to perform its function correctly, it should be located

into its intended organelle. This chapter, in brief, discusses the biology of

protein localization, Le. protein sorting and translocation. A background

knowledge of cell, its organelles and amino acids are essential to comprehend

the protein sorting. The beginning part of the chapter presents this back­

ground information. The chapter then progresses with a general discussion

on proteins and how they are synthesized in the cell. This is followed by a

detailed description of protein sorting and translocation. This section deals

with the major organelles of protein localization and how the organelles rec­

ognize their native proteins. In addition to these, the chapter also mentions

about the wetlab techniques employed for identifying protein localization.

The chapter is ended by highlighting the need for the computational predic­

tion of protein subcellular localization.

2.1 Background Biology

Knowledge of cell, different types of cells and amino acids are indispens­

able in understanding protein sorting and translocation. Proteins localize to

13

various organelles or locations in the cell or moves out of the cell as secre­

tory proteins. The organelles include nucleus, chloroplast, mitochondrion,

Endoplasmic Reticulum (ER) , Golgi apparatus, peroxisome etc. The next

subsection describes cell, different types of cell and organelles in the cell.

2.1.1 Cell and its Organelles

According to cell theory, one of the basic principles of biology, a cell is the

fundamental unit of structure, function and organization in living organ­

isms. The hereditary information is contained within the cell in the form of

deoxyribonucleic acid (DNA) and this information is passed from cell to cell

during cell division. A typical human cell is of size 10 m and humans have

around 100 trillion cells.

Different Types of Cell

Life on earth can be classified into prokaryotes and eukaryotes according to

the difference in their cell structure. Prokaryotes are unicellular organisms

like bacteria whereas eukaryotes are often multicellular organisms like plants

and animals. A prokaryotic cell is simpler than an eukaryotic and the main

difference is the lack of a well defined nucleus in the prokaryotes. Eukaryotic

cells are called so because of the presence of a true nucleus. The nucleus

has a well defined boundary defined by the nuclear membrane. In prokary­

otes, the genetic material DNA is concentrated in a region called nucleoid,

which do not have a membrane bound structure. The eukaryotic DNA is

linear and complexes with proteins called histones. The DNA of prokaryotes

is always circular. The DNA content of prokaryotes is only around 1 x 102

to 5 X 106 base pairs. Eukaryotes have much more DNA content and the

number of base pairs ranges from 1.5 x 107 to 5 X 109 . The cytoplasm of

eukaryotic cells contains many large and compound collections of organelles.

An organelle has its own boundary of lipid membrane which separates it

from the rest of the cell and there by allowing to perform a special function.

The prokayotes lack these membrane bound organelles like Golgi, lysosome,

peroxisome, mitochondria and chloroplast. The presence of the membrane

bound organelles makes eukaryotic cell more complex. The membrane bound

14

structure of the organelles enhances the efficiency of functions by restricting

them to occur within well defined boundary, thus limiting the span of com­

munication and movement within the organelle itself. The eukaryote cell

is much bigger, typically 10-100 micrometers in diameter, compared to the

prokaryotic cell which is typically 1 micrometer in diameter. The size of the

ribosomes present in the prokaryotic cell is smaller than that of eukaryotic

cell. Cytoskeleton, the organelle responsible for giving structure to the cell,

is not found in the prokaryotes. In prokayotes, the cell division happens in

simple steps by binary fission or simple fission. In eukaryotes the cell division

is of two types called mitosis and meiosis, which are complex multi-stage pro­

cess [23]. Within the eukaryotes, there is difference in cell structure between

plant and animal cell. Plant cell has a cell wall which is made of cellulose

and is intricately cross-linked with fibers of other carbohydrate molecules.

This structural pattern allows each cell to withstand the increased internal

pressure from osmosis, when the plant absorbs water. Animal cells do not

have rigid cell walls like plant cells and this allows them to take up a variety

of shapes. The chloroplasts in the plant cell are the site of photosynthesis.

This is absent in the animal cells. In chloroplast, carbon dioxide is turned

into sugar as part of photosynthesis. This is in opposite to energy production

in animal through mitochondria where sugar is broken down to carbon diox­

ide to make energy. The vacuole present in plant cells are large compared

to animal cells. The plant cell communicates by linking pores in their cell

wall to connect to each other and pass information. The communication in

animal cell is by an analogous system of gap-junctions. [6].

Organelles in the Cell

Organelles are membrane bound subunits, which can perform specific func­

tions. The most important organelle in a eukaryotic cell is the nucleus.

A typical animal cell is depicted in Figure 2.1. Organelles are membrane

bound subunits, which can perform a specific function. The most important

organelle in a eukaryotic cell is the nucleus. It is the store house of hered­

itary information, the DNA. Nucleus is surrounded by a double membrane

and the communication to the cytosol happens through nuclear pores present

in the membrane. The DNA present in the cell is the same for all cells in

15

Organelles: l.Nucleus 2.Nucleolus 3.Ribosome 4.Vesicle 5.Rough Endoplasmic Reticulum 6.Golgi appara­tus 7.Cytoskeleton 8.Smooth Endoplasmic Reticulum 9.Mitochondria 1O.Vacuole Il.Cytoplasm I2.Lyso­some I3.Centrioles within centrosome. Source: Wikipedia

Figure 2.1: Typical animal (eukaryotic) cell, showing subcellular components.

the body of an organism. The genes in the DNA of each cell are expressed

only according to the requirement of that cell. Depending on the specific

cell type, some genes may be turned on or off. At the time of cell division

the DNA condense into chromosomes. The nucleolus of the nucleus builds

ribosomes, which move out of the nucleus to cytoplasm. Ribosomes are the

site of protein synthesis. The mRNA which is copied from the DNA sequence

of the gene comes out of the nucleus through the nuclear pores and bind to

the ribosome. At ribosome, the mRNA is translated according to the genetic

codons with the help of tRNA. The amino acids corresponding to the genetic

codons are brought in by the tRNA. Peptide bonds are formed between these

amino acids linearly to build up the protein. Depending on the transloca­

tion pathway of the protein being synthesized, ribosomes attach themselves

to ER. The ER is of two types, rough ER and smooth ER. The ribosome

attach to rough ER which is involved in protein translocation and sorting.

The ER is a network of membranes extending through out the cytoplasm of

eukaryotic cell. The ER consists of tubular membranes and flattened sacs or

cisternae, which appear to be interconnected. The internal space enclosed

by the ER is called the lumen. The ER is continuous with the outer mem-

16

brane of the nuclear envelop. The smooth ER is involved in the synthesis of

lipids and steroids. The Golgi apparatus is a stack of flattened vesicles and is

closely related to ER in performing the function of protein sorting. Vescicles

that arise by budding off the ER are accepted by the Golgi complex. These

are further processed at the Golgi and are packaged for further translocation

by means of vesicles that arise by budding off the Golgi complex.

Lysosomes store hydrolase, the enzyme capable of digesting molecules

like proteins, carbohydrates and fats. Lysosomes are common in animal

cells but rare in plant cells. Peroxisome, which is present in both plant

and animal cell, resembles lysosome in size but differ in internal structure.

Peroxisome is responsible for protecting the cell from its own production of

toxic hydrogen peroxide. Vacuoles are membrane bound organelles used for

temporary storage and transportation of molecules. In plant cell, the central

vacuole maintains the turgor pressure. Mitochondria and chloroplasts have

double-membrane boundary and their own DNA. Mitochondrion is the power

house in the cell generating the ATP molecules. In muscle cells, number of

mitochondrion are present as there is a high demand for energy. Chloroplast

is found only in plant cells and is the site of photosynthesis. The cytoskeleton

is the cellular skeleton that provides a dynamic structure to the cell. The

cytoskeleton has important role in maintaining cell shape, enabling cellular

motion and intracellular transport.

In all the above discussed organelles, the biological functions are per­

formed by the proteins. The next section gives details about the amino acids

which are the building blocks of the proteins.

2.1.2 Amino Acids

All proteins are polymers of alpha-amino acids. Alpha-amino acids have the

general formula H 2NCHRCOOH, where R is an organic substituent. The

carbon atom next to the carbonyl group is called the alpha carbon. In the

alpha amino acids, the amino and carboxylate groups are attached to the

alpha carbon. The various alpha amino acids differ in the side chain (R

group) attached to their alpha carbon.

The physiochemical properties of the amino acids are defined by the side

17

chain. The physiochemical properties of the amino acid influence its interac­

tions with other amino acids, within a single protein and between proteins

which in turn determines the biological activity of the protein. An example

of the physiochemical property is hydrophobicity, the molecule's affinity to

water. The hydrophobicity of an amino acid is determined by the polarity of

the side chain. Hydrophobic amino acids are incapable of forming hydrogen

bonds with water and are buried within the hydrophobic core of the protein,

or within the lipid portion of the membrane. The distribution of hydrophilic

and hydrophobic amino acids plays important role in determining the tertiary

and quaternary structure of the protein. The amino acids that are encoded

by the standard genetic code and are used for protein synthesis is called

proteinogenic amino acids or standard amino acids. The proteinogenicnic

amino acids are alanine, cysteine, aspartic acid, glutamic acid, phenylalanine,

glycine, histidine, isoleucine, lysine, leucine, methionine, asparagine, proline,

glutamine, arginine, serine, threonine, valine, tryptophan and tyrosine. Ala­

nine is very abundant and versatile. It is not particularly hydrophobic and

is non-polar. Since it is neutral, it can be located in both hydrophilic and

hydrophobic regions on the protein. The alanine side chain is inert, and is

thus rarely directly involved in protein function. Cysteine is usually classified

as a hydrophobic amino acid. Within extracellular proteins, it is frequently

involved in disulphide bonds. Aspartic acid and glutamic acid are negatively

charged, polar amino acids. Being charged and polar, they prefer to be on

the surface of proteins, when exposed to an aqueous environment. When

buried within the protein they are frequently involved in salt-bridges, where

they pair with a positively charged amino acid to create stabilising hydro­

gen bonds, that can be important for protein stability. Phenylalanine is an

aromatic, hydrophobic amino acid and prefers to be buried in protein hy­

drophobic cores. The aromatic side chain makes phenyalanine to be involved

in stacking interactions with other aromatic side-chains. Phenylalanine side

chain is fairly non-reactive, and is thus rarely directly involved in protein

function. Glycine has only one hydrogen as its side chain and has good con­

formational flexibility. It can reside in parts of protein structures like tight

turns in structures, that are not possible for other amino acids. Histidine

is a polar amino acid and is the most common amino acids in protein active

18

sites. Isoleucine is an aliphatic, hydrophobic, amino acid. The isoleucine

side chain is very non-reactive, and is thus rarely directly involved in protein

function. Lysine is a positively charged, polar amino acid and is involved in

salt-bridges. Leucine is an aliphatic, hydrophobic amino acid and prefers to

be buried in protein hydrophobic cores. It is found more common in alpha

helices than in beta strands of protein secondary structure. As methionine

is a hydrophobic and aliphatic amino acid, it prefers to be buried in protein

hydrophobic cores. Asparagine is a polar amino acid and prefers generally

to be on the surface of proteins. It is frequently present in protein active or

binding sites. Proline plays important roles in molecular recognition, partic­

ularly in intracellular signaling. Glutamine is a polar amino acid and prefers

generally to be on the surface of proteins, exposed to an aqueous environ­

ment. Glutamines are frequently found in protein active or binding sites.

The polar side-chain is good for interactions with other polar or charged

atoms. Arginine is a positively charged, polar amino acid and involve in

salt-bridges. Serine is a polar amino acid. It can reside both within the

interior of a protein, or on the protein surface. Its small size makes it a good

candidate for turns on the protein surface, where i~ is possible for the serine

side-chain hydroxyl oxygen to form a hydrogen bond with the protein back­

bone. Serines are quite common in protein functional centres. The hydroxyl

group is fairly reactive, being able to form hydrogen bonds with a variety of

polar substrates. Threonine is a slightly polar amino acid. Threonine can

reside both within the interior of a protein, or on the protein surface and are

frequently found in protein functional centres. The hydroxyl group is fairly

reactive, being able to form hydrogen bonds with a variety of polar sub­

strates. Valine is an aliphatic, hydrophobic amino acid and is often buried in

protein hydrophobic cores. Tryptophan is an aromatic, hydrophobic amino

acid. Tryptophan prefers to be buried in protein hydrophobic cores. Being

aromatic, it is involved in stacking interactions with other aromatic side­

chains. The proteinogenic amino acids with their one letter and three letter

code is listed in the appendix.

Among these twenty amino acids, a subset of amino acids are called es­

sential amino acids because they cannot be synthesized by the human body.

These essential amino acids must be taken in with food. In humans, the

19

essential amino acids are lysine, isoleucine, phenylalanine, leucine, methio­

nine, tryptophan, threonine, valine, arginine and histidine. The remaining

standard amino acids are nonessential in the sense that the body can syn­

thesize them as needed. There are a large number of non-standard or non­

proteinogenic amino acids which are not found in proteins or not coded in

the standard genetic code.

The next section is a general description of proteins which are the vital

molecules in the cell.

2.2 Proteins

Proteins are the most abundant macromolecules in the cell. They are the

workhorses, carrying out vital biological functions. They perform critical

roles in growth, giving structure to cell, maintenance in tissues etc. Proteins

have a wide range of functions as enzymes, hormones, antibodies, structural

protein, storage protein and transport protein to name a few. Enzymes

facilitate biochemical reactions and are vital to metabolism. Hormones like

insulin, oxytocin and somatotropin are messenger proteins, giving signals

to coordinate various activities. Antibodies are proteins that defend the

body from antigens. Structural proteins like keratin, collagen, actin and

elastin are fibrous and stringy which help in providing structure, stiffness

and rigidity to otherwise-fluid biological components. Storage proteins like

ovalbumin and casein store amino acids. Transport proteins like hemoglobin

and cytochromes are carrier proteins which move molecules from one place

to another around the body.

Proteins are made of amino acids, connected together by the peptide

bonds between the carboxyl and amino groups of adjacent amino acid residues.

One end of this amino acid chain has a free amino group and is called amino

terminal or N-terminal. The other end, with a free carboxyl group, is called

the carboxyl terminal or C-terminal. The amino acid sequence is the order

in which amino acid residues appear in the protein. The amino acid sequence

is written in the order, starting from N-terminal and ending in C-terminal.

The linear order of the amino acids in a protein or peptide constitutes the

primary structure of the protein. Proteins cannot perform its intended func-

20

tion in the primary structure level. They fold to form secondary, tertiary

and quaternary structure. Secondary structure is formed by the hydrogen

bonds between the amino acids in the polypeptide. Secondary structures

are regularly repeating local structures. Multiple secondary structures can

be present in a single protein. Alpha helix and beta sheets are examples of

secondary structure. Tertiary structure is the three-dimensional structure

of the polypeptide chain into which it folds naturally or with the assistance

of chaperones. The function of a protein depends on its tertiary structure.

When denatured, the protein tertiary structure is disrupted and the protein

loses its activity. The tertiary structure is the spatial arrangement of sec­

ondary structures interacting through hydrophobicity, salt bridges, hydrogen

bonds, disulfide bonds, and post-translational modifications. In quaternary

structure, separate peptide chain, known as subunits join together to form a

complex.

The sequence of amino acids in a protein is decided by the three letter

codons in the messenger RNA (mRNA) from which the protein was trans­

lated. The sequence of codons in the mRNA is, in turn, decided by the

sequence of codons in the DNA from which the mRNA was transcribed. The

coding portion of DNA is known as genes. Thus, the instructions to define a

protein are written in the genes which reside in the nucleus.

The next section discuss how proteins are synthesised in the cell.

2.3 Protein Biosynthesis

Protein synthesis is the process in which cells build proteins. It is a multi­

step process of transcribing the genetic information in the gene to mRNA

and translating the information with help of tRNA to generate protein at

the ribosome. Protein biosynthesis differs in prokaryotes and eukaryotes.

The nucleus stores the genetic information which is the instruction to

generate the protein. The genetic information is written in deoxyribo nucleic

acid (DNA) a long molecule, made up of nucleotides. There are four nu­

cleotides Adenine, Thymine, Cytosine and Guanine. The long DNA molecule

is packed as chromosomes which are the carriers of the hereditary informa­

tion. Certain parts of the DNA contains biologically meaningful instruction

21

to form biomolecules. These parts are known as genes. When a protein is

needed in the cell, the gene which encodes for that particular protein has to

be expressed. In gene expression, the double helix of the DNA open up and

the gene is copied to mRNA. This is possible because of the complementari­

ties of the nucleotides. This mRNA undergoes several processing like splicing

of introns, the inter non-coding areas in the gene. The mRNA travels outside

of the nucleus to the ribosomes which are in the cytoplasm. The proteins

are made in the ribosome with the help of tRNAs. Ribosomes are made of

a small and large subunit which can surround the mRNA. The first step in

translation is the initiation, the binding of ribosome to mRNA. The informa­

tion in the mRNA is decoded to amino acid by the rules of the trinucleotide

genetic code. A table of trinucleotide genetic code and their corresponding

amino acid is given in the appendix. In the next step called elongation, the

triplet code is sensed by the tRNA which has a matching anticodon, and the

corresponding amino acid is added to the growing polypeptide chain. When

the triplet codon which acts as the stop codon is sensed, the translation stops

and the polypeptide chain is ready.

The next section gives a detailed description of how the protein synthe­

sized in the ribosome reaches its targeted location.

2.4 Protein Sorting and Transport

For a cell to function properly, each of its numerous proteins must be localized

to the correct organelle like chloroplast, mitochondria, lysosome. Hormone

receptor proteins must be delivered to the plasma membrane for the cell to

recognize hormones, and specific ion-channel and transporter proteins are

needed in the membrane, for the cell to import or export the corresponding

ions and small molecules. Enzymes such as RNA and DNA polymerases must

be targeted to the nucleus for gene expression and protein synthesis. Prote­

olytic enzymes or catalase, must go to lysosomes or peroxisomes, respectively

for proper functioning. Hormones must be directed to the cell surface and

secreted. The process of directing each newly made protein to its particular

destination is critical to the organization and functioning of eukaryotic cells

and this is referred to as protein targeting or protein sorting. [24].

22

2.4.1 Protein Sorting

Except for a small number of proteins, coded in the genomes of mitochondria

and chloroplasts, most of the proteins in a cell are encoded by nuclear DNA

and are synthesized on ribosomes in the cytosol. For proper functioning,

these proteins are to be distributed to their correct destinations in the cell. In

1999, Gunter Blobel was awarded Nobel Prize in Physiology or Medicine for

the discovery that "proteins have intrinsic signals that govern their transport

and localization in the cell." The sorting signals are present in the primary

amino acid sequence levels mostly at its N terminal. For further sorting

within the organelle, additional targeting information may be located in a

secondary targeting sequence, either placed adjacent to the original targeting

sequence or in other regions of the protein.

Proteins are translocated to their targeted location either cotranslation­

aly or posttranslationaly. In cotranslational translocation, the translocation

starts while the protein is still being synthesized on the ribosome. Proteins

targeted for ER, Golgi apparatus, plasma membrane, lysosome, vacuole and

extracellular space uses the SRP-dependent pathway and are translocated

cotranslationally. The N-terminal signal sequence of these proteins, is recog­

nized by a signal recognition particle (SRP), while the proteins being trans­

lated in the free ribosome. The ribosome-protein complex is transferred to a

SRP receptor on the ER and the synthesis pauses. There, the nascent protein

is inserted into the the translocon that passes through the ER membrane.

Transfer of the ribosome-mRNA complex from the SRP to the translocon

opens the gate on the translocon and allows the translation to resume. The

signal sequence is immediately cleaved from the polypeptide once it has been

translocated into the ER by signal peptidase in secretory proteins. Within

the ER, chaperone helps protein to fold correctly. From ER, proteins are

transported in vescicles to the Golgi apparatus where they are further pro­

cessed and sorted for transport to endosomes, lysosomes, plasma membrane

or secretion from the cell. The proteins for ER will have various ER retention

signals to keep them in the ER itself.

Most of the proteins targeted for mitochondria, chloroplast, nucleus and

peroxisome are translocated posttranslationaly. In contrast to the cotrans­

lationaly translocated proteins, these proteins are translated in the free rib-

23

somes in the cytosol. Once the translation is complete, they are released

into the cytosol. These proteins which enter the non-secretory pathway are

sorted to their destination site based on the presence of the targeting sig­

nal [25]. Once the protein has reached its destination, the targeting signals

are cleaved off. The targeting sequence for mitochondrial proteins, mito­

chondrial transfer peptide (mTP), will have 3 - 5 nonconsecutive Arg or Lys

residues, often with Ser and Thr, at the N-terminal ofthe polypeptide chain.

No Glu or Asp residues are generally found here. In the case of chloroplast,

chloroplast transit peptide (cTP), no common sequence motifs are found but

the N-terminal is generally rich in Ser, Thr, and small hydrophobic amino

acid residues and the region is poor in Glu and Asp residues. For peroxisome

proteins, the sorting signal is generally found at extreme C-terminal usually

as Ser-Lys-Leu and these signals are not cleaved off after reaching the desti­

nation. Proteins destined for nucleus have a distributed sorting signal which

is not cleaved off after sorting. One cluster of 5 basic amino acids or two

smaller clusters of basic residues, separated by around 10 amino acids are

usually found as nuclear localization signal.

In the next section the major protein localization sites are discussed.

2.4.2 Major Locations

Proteins are sorted to their locations with the help of an address signal

present in the primary structure level. Each organelle has a mechanism to

identify its own proteins. In this section, important protein localization sites

like nucleus, mitochondrion, chloroplast, peroxisome, and secretory proteins

are explained.

Endoplasmic Reticulum

The Endoplasmic Reticulum is the first branching point in protein sorting.

Figure 2.2 shows nucleus, ER and Golgi Apparatus in eukaryote cell. Most of

the proteins targeted for secretion, Golgi apparatus, plasma membrane, vac­

uole, lysosome are translated on the ribosomes bounded to the Endoplasmic

Reticulum and they enter into the ER cotranslationally. Only a few pro­

teins enter the ER posttranslationally. The protein translation starts at the

24

free ribosomes in the cytosoL The synthesis continues till the sorting signal

which is present in the N-terminal emerges. This sorting signal is recognized

by signal recognition particle. The SRP binds to the sorting signal and the

translation pauses. The complex of SRP, ribosome, polypeptide chain and

mRNA moves to the ER and the polypeptide chain enters the ER through

translocon. The translocon is a protein complex containing various compo­

nents used for protein translocation. The SRP receptor of the translocon

binds with the SRP, the ribosome receptor binds with the ribosme and hold

it in the correct position, the pore protein forms the channel through which

the growing polypetide enter the ER lumen, the signal peptidase cut the sig­

nal once it enters the ER. After the SRP and ribosomes are bound by SRP

receptor and ribosme receptor respectively, GTP binds to the the complex of

SRP and SRP receptor and the translation resumes. This causes the transfer

of the signal sequence into the channel of pore protein. Then the GTP is

hydrolysed and the SRP is released. While the sorting signal remains bound

at the the pore protein, the polypeptide grows into a loop and translocates

into the ER lumen. When the polypetide synthesis is finished, the signal

peptidase cleaves off the sorting signal, releasing the polypeptide into the

ER lumen. After this, the ribosome detaches from the ER and dissociate

into its subunits, and the mRNA is released. Inside the ER, the polypeptide

chains are folded into their native forms usually with the help of molecular

chaperones, which controls the quality of protein folding [23].

Integral membrane proteins of the plasma membrane or the membranes of

the ER, Golgi apparatus, and lysosome are first inserted into the membrane

of ER. These proteins do not enter the lumen cotranslationally but anchored

to the ER membrane by membrane spanning 0: helices that stop transfer of

the growing polypeptide chain across the membrane.

Proteins travel along the secretory pathway in transport vesicle, which

bud from the membrane of one organelle and then fuse with the membrane

of another. The proteins are exported from the ER in vesicles that bud from

the transitional ER and carry their cargo through the ER-Golgi intermediate

compartment and then to Golgi apparatus. The proteins targeted for the ER

has a retention signal in their C terminal that makes them come back to the

ER even if they are exported from the ER. Two such retention signals are

25

1. Nucleus 2. Nuclear pore 3. Rough endoplasmic reticulum (RER) 4. Smooth endoplasmic reticulum 5.Ribosome on the rough ER 6. Proteins that are transported 7. Transport vesicle 8. Golgi apparatus 9.Cis face of the Golgi apparatus 10. Trans face of the Golgi apparatus 11. Cisternae of the Golgi apparatus.Source: Wikipedia

Figure 2.2: Nucleus, ER and Golgi Apparatus in eukaryote cell

KDEL (Lys-Asp-Glu-Leu) and KKXX (two lysine residues followed by any

two amino acids) present in the C-terminal of the sequences. If the signal is

removed from the ER proteins, they are transported to Golgi and then move

out of the cell. The ER retention signals do not prevent the ER proteins from

being packaged and exported from the ER. Instead these signals retrieve the

ER proteins from Golgi apparatus or ER-Golgi intermediate compartments

and put them back to ER using a recycling pathway. Specific recycling

receptors bind to these retention signals and bring them back to ER. There

are many retention signals other than KDEL and KKXX but they are not

well characterized.

Goigi Apparatus

Golgi apparatus is composed of flattened membrane-enclosed sacs called cis­

ternae and associated vesicles. The Golgi apparatus is a main center for pro­

tein sorting. It receives proteins from the ER and further process them and

sort them to their targeted location: lysosomes, endosomes, plasma mem-

26

brane, or extracellular. The proteins from the ER enter the cis face of the

ER which is convex in shape and is oriented towards the nucleus. They are

transported through the Golgi and exit from its concave shaped trans face.

The proteins that function within the Golgi has to be retained from export.

All proteins known to be retained in the Golgi complex are associated with

the Golgi membrane and their retention signals are present in the trans­

membrane domain. This prevents these proteins from being packaged in the

transport vesicle that leave trans Golgi network.

Membranes

Most of the eukaryotic membrane proteins are inserted into the ER mem­

brane using the translocon complex used for protein secretion. They are

inserted into the membrane by translocation, until the process is interrupted

by a stop-transfer sequence, also called a membrane anchor sequence. These

membrane proteins are understood to be using the same model of targeting

for secretory proteins. In contrast to secretory proteins, the first transmem­

brane domain acts as the first signal sequence and targets them to the ER

membrane. This results in the translocation of the amino terminus of the

protein into the ER membrane lumen.

Transmembrane proteins span the entire membrane. The transmembrane

regions of the proteins are either a-helical or ,B-barrels. a-helical proteins are

the major category of membrane proteins and are often found in the inner

membranes of bacterial cells, the plasma membrane of eukaryotes and in the

outer membranes. ,B-barrels proteins are found in outer membranes of Gram­

negative bacteria, cell wall of Gram-positive bacteria, and outer membranes

of mitochondria and chloroplasts. No common localization signal was ob­

served for membrane proteins. Helical transmembrane proteins are usually

identified from the distribution of the hydrophobic amino acids. The trans­

membrane regions are significantly more hydrophobic than an average piece

of sequence. The length of the transmembrane region varies depending on

the angle between the helix and the membrane and the kind of membrane

the protein resides in. Usually transmembrane region is of 14 to 36 residues

in length. Cell membrane proteins are usually identified by the skewed dis­

tribution of charges between inner and outer loops [26].

27

Extracellular Proteins

Extracellular proteins or secreted proteins are fundamental to intercellular

communications in multicellular organisms. The extracellular accessibility

of these proteins makes them ideal targets for protein therapeutics. Virtu­

ally all protein-based therapeutic drugs in the market target these secreted

and cell-surface proteins. Secreted proteins and a majority of cell-surface

proteins possess an N-terminal address signal known as signal peptide [27].

The signal peptide (SP) has a length of nearly 20-25 residues. The enzyme,

signal peptidase (SPase) cleaves off the signal peptide during the export pro­

cess. Small and apolar residues like alanine are found at positions -1 and

-3 relative to the cleavage site. The N-terminal domain of the signal pep­

tide is usually positively charged. The central region will be hydrophobic

and leucines are the most common amino acids in this region. The cleavage

site region is usually populated with small residues [26, 28, 29]. Secretion

happens through different pathways and most important among them are

SRP-dependent (Signal Recognition Particle) pathway [30,31] and the SRP

independent pathway. In SRP-dependent pathway, the nascent polypeptide

chain is recognized by SRP and the translation is paused and the translation

complex is brought to the SRP receptor. There the polypeptide chain is

translocated through the Sec machinery and the translocation resumes. The

SRP-independent pathway (know as Sec-dependent pathway in prokaryotes)

involves post-translational translocation and employs many proteins and the

hydrolysis of ATP, for identification of the signal peptide and translocation.

In prokaryotes, the deltapH or TAT (twin-arginine translocation) pathway is

also used for secretion [32,33]. It needs no ATP but requires a pH-gradient

over the membrane. Proteins transported via this route contain a twin­

arginine motif in the N-terminal part of the signal peptide, and the signal

peptide is longer than others [26].

Nucleus

Nucleus is known as the control centre of the cell and is the largest organelle in

animal cell. It is the storage place of the genetic material, DNA. A eukaryote

nucleus and subnuclear locations are given in Figure 2.3. Proteins are trans-

28

Nuc.ii:~ar' (·mvelope

Outer membraneInner membrane

C:hr()fn~1in

Heterochromatin

Euchromatin

Ribosomes

Nuclear pore

Source: Wikipedia

Figure 2.3: The nucleus of eukaryotic cell

ported into the nucleus posttranslationally and in a folded state. Most of the

nuclear proteins are imported to nucleus with the help of carrier proteins (eg

importins). These carrier proteins form a complex with the proteins that are

to be imported into the nucleus, and this complex is translocated through the

nuclear pore. Inside the nucleus, the complex is dissociated and the importin

is shuttled back to the cytoplasm and reused [26]. The address signal for nu­

cleus is known as nuclear localization signal (NLS) and is a short stretch of

amino acids. The deletion of the NLS from a nuclear protein disrupts nuclear

import and the addition of NLS to a non-nuclear protein facilitate nuclear

import. These details have been widely used to experimentally unravel NLS

motifs [34-36]. The nuclear localization signals can be present anywhere in

the protein sequence. Since NLSs do not have any particular consensus se­

quence, it is difficult to differentiate an NLS from a non-NLS region [26].

Usually NLS is rich with positively charged residues, since some of these

positive residues bind to carrier proteins like importins [37]. Mutating these

positively charged amino acids will disrupt nuclear import. However, there

are Glycine-rich NLS motifs with few positive charges like monopartite and

29

bipartite motifs [36, 38]. Monopartite consists of four basic and one helix­

breaking residues, and the bipartite consists of two clusters of basic residues

with a spacer of 9-12 amino acids in between [39,40]. But these patterns also

are not at all unique to nuclear proteins and may well be observed in many

other proteins [25,36,41-43]. Other observed NLS includes, the 38 amino acid

long M9 sequence and the repeated G-R motif [44]. However, these signals

are in general significantly less frequent than the monopartite and bipartite

NLS. There are also signals for nuclear protein export and retention [26].

Mitochondrion

Mitochondria is known as the power house of the cell as they generate most

of the cell's supply of adenosine triphosphate (ATP) in the process of cellular

respiration by breaking down carbohydrates and fatty acids. A typical mi­

tochondrion is shown in Figure 2.4. Mitochondria consist of a smooth outer

membrane and an inner membrane separated by an intermembrane space.

The inner membrane forms numerous folds known as cristae. The space in­

side the inner membrane is called the mitochondrial matrix and contains the

genetic material of mitochondria. The matrix and inner membrane represents

the major working compartments of the mitochondria. As sugar is burned for

fuel, a mitochondrion shunts various chemicals back and forth across the in­

ner membrane. Even though mitochondrion has a genome of its own, it does

not code for the proteins necessary for DNA replication, transcription and

translation. All these proteins, the proteins required for oxidative phospho­

rylation and the proteins to act as enzymes has to be generated from nuclear

DNA and imported into the mitochondria. The double membrane structure

of the mitochondrion makes the protein import a difficult task. The proteins

for the matrix of mitochondria have to cross two membranes. The proteins

for other location have to be resorted with a secondary targeting signal, once

they reach mitochondria. The sorting signal of mitochondrion is known as

mitochondrial transfer peptide (mTP) and is on average 35 amino acids long.

The mTP binds to the receptors on the surface of mitochondria. These re­

ceptors are part of TOM (Translocase of the Outer Membrane) complex that

directs translocation across the outer membrane. The individual receptors,on the TOM complex are TOM20, TOM22 and TOM5. From these recep-

30

Inner membraneOuter membrane

Deoxyribonucleic acid (DNA)

Source: Wikipedia

Figure 2.4: Typical mitochondrion

tors, proteins are transferred to the TOM40 pore protein and translocated

across the outer membrane. The protein is transported, via the GIP com­

plex (general import pore), in an ATP-requiring process through the outer

mitochondrial membrane. The proteins are then transferred to a second

protein complex in the inner membrane, the TIM (Ttanslocase of the Inner

Membrane) complex for translocation into the matrix. The translocation

is through a process that requires an electrochemical hydrogen ion gradient

across the inner membrane [26]. After entering mitochondrial matrix, the

mTP is cleaved off by the mitochondrial processing peptidase, MPP (Matrix

Processing Peptidase) by proteolytic cleavage [45,46]. Some mitochondrial

matrix proteins are then cleaved again by the mitochondrial intermediate

peptidase (MIP) which removes an additional eight or nine residues from the

N-terminus [47,48]. For some proteins, a second adjacent targeting signal

that resembles the signal peptide for secretion is exposed after MPP cleav­

age. These proteins are re-exported from the matrix to the intermembrane

space (IMS), or inserted into the inner membrane, in a process very similar to

bacterial protein secretion. Alternatively, the translocation over either of the

membranes is halted by a stop-transfer signal, which is specifically recognised

31

..........." · · 0.~ -®

1. outer membrane 2. intermembrane space 3. inner membrane (1+2+3: envelope) 4. stroma 5. thylakoidlumen (inside of thylakoid) 6. thylakoid membrane 7. granum (stack of thylakoids) 8. thylakoid (lamella)9. starch 10. ribosome 11. plastidial DNA 12. plastoglobule (drop of lipids). Source: Wikipedia

Figure 2.5: Typical chloroplast

by a TOM or TIM component [26,49,50], and the protein is subsequently

inserted into the outer or inner membrane, respectively.

The inner membrane metabolite carrier proteins of mitochondria con­

tain internal localization signals [51]. In mitochondrial targeting peptides

(mTPs), Arg, Ala and Ser are over-represented while negatively charged

amino acid residues (Asp and Glu) are rare [51,52]. Other than this, there

is no obvious features that distinguish the mTP from other N-terminal se­

quences. The degree of sequence conservation around the cleavage site is

also poor. Many mTPs have an arginine in position -2 or ...3 relative to

the MPP cleavage site [53,54]. It is reported that, the mTP forms an am­

phipathic alpha-helix when bound to the receptor protein but adopts an

extended structure, when processed by the MPP [55-58].

Chloroplast

The chloroplast is double membrane bound organelle present in photosyn­

thetic plants and algae. Figure 2.5 shows a typical chloroplast. In addition

to the inner and outer membranes of the envelope, chloroplasts have a third

internal membrane system, called the thylakoid membrane. The thylakoid

membrane forms a network of flattened discs called thylakoids, which are

frequently arranged in stacks called grana. Because of this three-membrane

32

structure, the internal organization of chloroplasts is more complex than that

of mitochondria. In particular, the three membranes divide chloroplasts into

three distinct internal compartments: the intermembrane space between the

two membranes of the chloroplast envelope; the stroma, which lies inside the

envelope but outside the thylakoid membrane; and the thylakoid lumen [6].

Stroma is the site of the dark reactions, more properly called the Calvin

cycle. Stacks of thylakoids are called granum. Even though it has a small

genome of its own in stroma, the majority of chloroplast proteins are encoded

in the nuclear genome and post-translationally imported into the organelle.

Protein import into chloroplasts generally resembles mitochondrial pro­

tein import. Proteins are targeted for import into chloroplasts by N-terminal

sequences of 30 to 100 amino acids, called chloroplast transit peptides(cTP),

which direct protein translocation across the two membranes of the chloro­

plast envelope and are then removed by proteolytic cleavage. The transit

peptides are recognized by the translocation complex of the chloroplast outer

membrane (the Toc complex), and proteins are transported through this com­

plex across the membrane. They are then transferred to the translocation

complex of the inner membrane (the Tic complex) and transported across

the inner membrane to the stroma. As in mitochondria, the translocation

requires energy in the form of ATP. In contrast to the mTP, transit peptides

are not positively charged and the translocation of polypeptide chains into

chloroplasts does not require an electric potential across the membrane [6].

Inside the chloroplast, the cTP is cleaved off by the stromal processing

peptidase (SPP). cTPs are rich in hydroxylated residues, especially serines,

and have a low content of acidic residues [51]. The cTPs from different pro­

teins varies from 20 to 120 residues in length. At the N-terminus of cTP,

there is a conserved alanine next to the initial methionine. A semiconserved

motif, V-R-A-(:)-A-A-V, around the SPP cleavage site (denoted by:) has

also been recognized [52]. The signal is not very strong and there are sev­

eral proteins that are located to both mitochondria and chloroplasts using

identical sorting signals [26,51,59,60].

Proteins designated for the lumen of the intra-chloroplastic thylakoid

compartment normally have a bipartite targeting sequence composed of an

N-terminal stroma targeting cTP followed by a thylakoid lumen transfer

33

peptide (LTP) [61,62]. There are two different pathways from the chloro­

plast stroma into the thylakoid lumen, the Sec-dependent pathway and the

delta-pH or twin arginine translocation (TAT) pathway [63]. The signals

for the two pathways are very similar, the only significant difference being

that the TAT pathway proteins contain a twin-arginine (RR) motif in the

LTP (KR and RK may also be accepted). The -3, -1 motif found at the SP

cleavage site in secreted proteins is present also in LTPs, and more strongly

conserved [26,33,64].

Many proteins are needed in both mitochondria and chloroplasts. In

general the targeting peptide is of intermediate character to the two specific

ones. The targeting peptides of these proteins have a high content of basic

and hydrophobic amino acids, a low content of negatively charged amino

acids. They have a lower content of alanine and a higher content of leucine

and phenylalanine. The dual targeted proteins have a more hydrophobic

targeting peptide than both mitochondrial and chloroplastic ones [26].

Peroxisome

Peroxisome is a single membrane bounded organelle. There are two types

of known Peroxisome Targeting Signals (PTS), one in the C-terminal region

(PTS1), and another in the N-terminal (PTS2). Among the two signals

PTS1 is the predominant and its consensus sequence is -(SfA/C)-(K/R/H)­

(L/A). The most common PTS1 is serine-Iysine-Ieucine (SKL). The soluble

Pex5 receptor recognizes the PTS1-containing proteins. The Pex5-PTS1­

protein complex is then docked to the translocation machinery on the surface

of peroxisom. PTS2 is a bipartite signal with consensus sequence [R/K]­

[L/V/I]-x-xx-x-x-[H/Q]-[L/A], usually located in the N-terminal [26]. The

next section briefs how the location of proteins are identified in the wetlabs

by experiments.

2.5 Wetlab Techniques

A wide range of experimental methods are used in the wetlabs to identify

protein subellular localization. Immunofluorescence and immunoelectron mi­

croscopy, PhoA protein fusions, fluorescent-protein tagging, and Western/SDS-

34

PAGE analysis of subcellular fractions are used for this purpose. Even though

the output of these methods are highly accurate, they have several limita­

tions like only a few proteins can be tested at a time and they are costly,

time-consuming, and the number of proteins for which it can be used is rel­

atively low. One of the laboratory techniques for subcellular localization

identification is transposon-mediated random epitope tagging and plasmid­

based expression of epitope-tagged proteins followed by immunofluorescence.

This method had been used for comprehensive global analysis of protein lo­

calization performed in the budding yeast, Saccharomyces cerevisiae. The

disadvantage is that these techniques can introduce potential errors in local­

ization by interfering with localization signals via random insertion of tags

or saturation of binding sites by over expression of proteins. In addition,

the immunofluorescence adds a cumbersome and costly step to the analy­

sis that may introduce non-specific staining [17,65]. A study of subcellular

localization on yeast [66] employed oligonucleotide-directed homologous re­

combination to insert GFP preceding the stop codon of open reading frames

(ORFs) and generate yeast strains with proteins tagged at the carboxy ter­

minus. The proteins were expressed under the control of their endogenous

promoters and presumably at relatively normal levels. On the other hand,

the carboxyl terminal tagging interfered in some cases with protein localiza­

tion signals such as palmitoylation and farnesylation that direct proteins to

the plasma membrane.

Techniques such as two-dimensional gel electrophoresis and mass spec­

trometry have been frequently used to analyze localization for a variety of

bacterial genomes, including pathogenic organisms. A major disadvantage of

subproteome analysis is that the fractionation of a complex structure like the

cell into several subcellular compartments is not a trivial task, because of the

contamination from other cellular compartments and the multiple localiza­

tion of several proteins. Computational methods for subcellular localization

prediction solve these problems to a great extend.

35

2.6 Need for Computational Prediction

Although the subcellular localization of a protein can be determined by con­

ducting various biochemical experiments they have many practical limita­

tions and are costly and time consuming. High throughput genomic tech­

niques in the past decade have resulted in rapid accumulation of genomic

and proteomic data in the biological databases. For example, in 1986 the

total sequence entries in Swiss-Prot was only 3,939 [19,20] while the number

was increased to 514789 sequence entries as of Swiss-Prot release 57.14 of

9-Feb-10. This explosive growth of the biological databases demands devel­

opment of automated methods with high accuracy to reliably annotate the

subcellular attributes of uncharacterized proteins.

For proteins, that are only predicted from the sequenced genome and not

extracted as biological molecule, the only available data will be the predicted

amino acid sequence. In such cases, the features of proteins, including subcel­

lular localization can be predicted using the computational methods. These

annotation will bring out the importance of that particular protein. Proteins

which are isolated and sequenced often lack the N-terminal signal in it, as

the import machinery of the compartments cleave off the address signal in

the protein. Even these information loss will not affect prediction, because

most of the tools use a wide range of biological information that are derived

from the sequence for making prediction.

Most of the computational prediction tools are available in the Internet.

They are publicly accessible and free of cost. Since a wide range of tools

are available, the biologist can make prediction using different methods to

increase the reliability of the prediction. Even organism specific tools are

available, providing greater accuracy for the prediction. Performing the pre­

diction prior to experimental confirmation will save valuable resources. Most

of the tools accept multiple amino acid sequence as input and allows high

throughput perdition. The computational methods for localization predic­

tion are reviewed in the next chapter.

36

2.7 Conclusion

The protein sorting and translocation is a complex task involving multiple

decision makings at multiple stages. Various proteins are involved in the

translocation process. As described, no hard and fast rules can be derived

for any locations. The address signals do not share common features in

many cases. These difficulties can be addressed by computational prediction

techniques. For making a computational prediction, biological features which

have qualitative impact on the biological process have to be observed and

quantified. The biology discussed in this chapter serves towards this purpose.

This chapter discussed the biology of subcellular localization. The back­

ground biology of cell, organelles and amino acids were explained, followed

by details on proteins and their biosythesis. How proteins are sorted and

translocated to various locations were also explained. The wetlab techniques

for identifying the protein location and their limitations were briefed. The

need for prediction methods were discussed towards the end of this chapter.

The next chapter discuss how to computationally predict the location of pro­

tein. The chapter also provides a detailed review of existing computational

methods and tools for the subcellular localization prediction.

37