bioinformatics

By A.Arputha Selvaraj

What we are going to talk about Why we are doing all this DNA

sequencing What genes look like and where they

are found How we can compare sequences

between different species How genes move between species

DNA Sequencing

Bioinformatics is based on the fact that DNA sequencing is cheap, and becoming easier and cheaper very quickly. the Human Genome Project cost roughly

$3 billion and took 12 years (1991-2003). Sequencing James Watson’s genome in

2007 cost $2 million and took 2 months Today, you could get your genome

sequenced for about $100,000 and it would take a month.

The Archon X prize: you win $10 million if you can sequence 100 human genomes in 10 days, at a cost of $10,000 per genome.

It is realistic to envision $100 per genome within 10 years: everyone’s genome could be sequenced if they wanted or needed it.

Why it’s useful

All of the information needed to build an organism is contained in its DNA. If we could understand it, we would know how life works. Preventing and curing diseases like cancer

(which is caused by mutations in DNA) and inherited diseases.

Curing infectious diseases (everything from AIDS and malaria to the common cold). If we understand how a microorganism works, we can figure out how to block it.

Understanding genetic and evolutionary relationships between species

Understanding genetic relationships between humans. Projects exist to understand human genetic diversity. Also, sequencing the Neanderthal genome. Ancient DNA: currently it is thought that under ideal

conditions (continuously kept frozen), there is a limit of about 1 million years for DNA survival. So, Jurassic Park will probably remain fiction.

From DNA to Gene But: extracting that information is difficult. How to convert

a string of ACGT’s into knowledge of how the organism works is hard.

Most of the work is on the computer, with key confirming experiments done in the “wet lab”.

The sequence below contains a gene critical for life: the gene that initiates replication of the DNA. Can you spot it?

We are now going to spend some time on what genes look like and how we can find them.

TTGGAAAACATTCATGATTTATGGGATAGAGCTTTAGATCAAATTGAAAAAAAATTAAGCAAACCTAGTTTTGAAACCTGGCTCAAATCGACAAAAGCTCATGCTTTACAAGGAGACACGCTCATTATTACTGCACCTAATGATTTTGCACGGGACTGGTTAGAATCTAGGTATTCTAATTTAATTGCTGAAACACTTTATGATCTTACGGGGGAAGAGTTAGATGTAAAATTTATTATTCCTCCTAACCAGGCCGAGGAAGAATTCGATATTCAAACTCCTAAAAAGAAAGTCAATAAAGACGAAGGAGCAGAATTTCCTCAAAGCATGCTAAATTCGAAGTATACCTTTGATACATTTGTTATCGGATCTGGAAATCGGTTTGCGCATGCAGCTTCTTTAGCAGTAGCAGAAGCGCCGGCTAAAGCGTATAATCCGCTTTTTATTTACGGGGGAGTAGGATTAGGCAAAACACACTTAATGCACGCCATAGGCCACTATGTGTTAGATCATAATCCTGCCGCGAAAGTCGTGTACTTATCATCTGAAAAATTCACAAACGAGTTTATTAACTCTATTCGTGACAATAAAGCAGTAGAATTCCGCAACAAATACCGTAATGTAGATGTTTTACTGATTGATGATATTCAATTCTTAGCAGGTAAAGAGCAGACACAAGAAGAATTTTTCCATACGTTTAATACGCTTCACGAAGAAAGCAAGCAGATTGTCATCTCAAGTGATCGACCGCCGAAAGAAATTCCTACACTTGAAGATCGACTTCGCTCTCGCTTTGAATGGGGCCTTATTACAGACATCACACCACCAGATTTGGAAACACGAATTGCTATTTTGCGTAAAAAAGCCAAAGCGGACGGCTTAGTTATTCCAAATGAAGTTATGCTTTATATCGCCAATCAGATTGATTCAAATATTAGAGAATTAGAAGGCGCACTTATT

DNA DNA is just a long string of 4

letters (nucleotides, or bases): Adenine, Guanine, Cytosine, and Thymine. Which we will just refer to as A,

C, G, and T and we are skipping lots of

details Each DNA molecule has 2

strands, with the bases paired in the center A on one strand always pairs

with T on the other strand G pairs with C. the strands run in opposite

directions (like roads) Since the two DNA strands are

complementary, there is no need to write down both strands

Chromosomes and Genes each chromosome is a long piece of DNA

B. megaterium genome is a circle (like most bacteria) of about 5 million bases.

Human chromosomes are 100-200 million bases long. We have 46 chromosomes (2 sets of 23, one set from each parent).

genes are just regions on that DNA. It is not obvious where genes are if you look at a DNA sequence. there is a lot of DNA that is not part of genes: in

humans only 2% at most of the DNA is part of any gene.

Bacteria use more of their DNA: 80% of the B. meg chromosome is genes.

B. meg has about 1 gene per 1000 base pairs (bp) of DNA. About 5000 genes

Humans have about 25,000 genes. We are far more complicated than bacteria:

regulation of the genes is very complicated in humans

We use the same gene in different ways in different tissues

Genes and Proteins

Most genes code for proteins: each gene contains the information necessary to make one protein.

Proteins are the most important type of macromolecule. Structure: collagen in skin, keratin in hair,

crystallin in eye. Enzymes: all metabolic transformations,

building up, rearranging, and breaking down of organic compounds, are done by enzymes, which are proteins.

Transport: oxygen in the blood is carried by hemoglobin, everything that goes in or out of a cell (except water and a few gasses) is carried by proteins.

Also: nutrition (egg yolk), hormones, defense, movement

The Genetic Code Proteins are long chains of amino acids. There are 20 different amino acids coded in

DNA There are only 4 DNA bases, so you need 3

DNA bases to code for the 20 amino acids 4 x 4 x 4 = 64 possible 3 base combinations

(codons) Each codon codes for one amino acid Most amino acids have more than one possible

codon Genes start at a start codon and end at a

stop codon. 3 codons are stop codons: all genes end at

a stop codon. Start codons are a bit trickier, since they are

used in the middle of genes as well as at the beginning in eukaryotes, ATG is always the start codon,

making Methionine (Met) the first amino acid in all proteins (but in many proteins it is immediately removed).

In prokaryotes, ATG, GTG, or TTG can be used as a start codon. B. meg prefers ATG, but about 30% of the genes start with GTG or TTG.

In bioinformatics, we generallyignore the fact that RNA uses thebase uracil (U) in place of T.

Gene Expression

How do you get a protein from a gene? A two-step process (called the Central

Dogma of Molecular Biology). First, the gene has to be copied (transcribed)

into an RNA form. The RNA copy (messenger RNA) is exactly

like the gene itself, except RNA replaces T with U.

Most gene regulation: whether the gene is “on” or “off” happens here

Second, the RNA is translated into protein by ribosomes, which are complex RNA/protein hybrid machines. With the help of transfer RNA molecules, which

have one end that matches the 3 base codon and the other end that is attached to the proper amino acid.

The ribosome starts at the start codon and moves down the messenger RNA, adding one amino acid at a time to the growing chain. When the ribosome reaches a stop codon, it falls off, releasing the new protein.

Reading Frames

Here we get a bit subtle. Since codons consist of 3 bases,

there are 3 “reading frames” possible on an RNA (or DNA), depending on whether you start reading from the first base, the second base, or the third base. The different reading frames give

entirely different proteins. Consider ATGCCATC, and refer to

the genetic code. (X is junk) Reading frame 1 divides this into ATG-

CCA-TC, which translates to Met-Pro-X Reading frame 2 divides this into A-

TGC-CAT-C, which translates to X-Cys-His-X

Reading frame 3 divides this into AT-GCC-ATC, which translates to X-Ala-Ile

Each gene uses a single reading frame, so once the ribosome gets started, it just has to count off groups of 3 bases to produce the proper protein.

Open Reading Frames Ribosomes are very obedient to stop codons:

when a stop codon is reached, the protein is finished. Thus, all genes end at the first stop codon in their reading frame.

Since 3 out of the 64 codons are stop codons, random DNA has stop codons very frequently. However, genes do something necessary for

survival, so natural selection keeps stop codons out of the middle of genes. That is, if a mutation arises that creates a stop codon in

the middle of a gene, the organism dies and leaves no descendants.

Open reading frames (ORFs) are regions with no stop codons. All genes reside in long open reading frames

Note that stop codons in other reading frames have no effect on the gene.

The start codon must occur “upstream” in the same reading frame as the stop codon. It is usually near the beginning of the ORF, but not necessarily the first possible start codon. Determining the exact start codon is not easy or

obvious. But, the first stop codon in an open reading frame is

always a reasonable guess

This is a map of the stop codons in all 3 readingframes in a stretch of DNA.The long ORF in reading frame1 is highlighted in black.

Gene Placement Genes can occur on either DNA strand.

If they are on the reverse strand, the DNA sequence needs to be reversed and complemented

In bacteria, most of the DNA is part of a gene. Most long open reading frames (say 100 bp or longer) that don’t overlap other long ORFs contain genes

Most genes do not overlap each other. Sometimes there are very short overlaps (50 bp or less), especially

if the two genes are functionally related. In bacteria, genes that affect the same biochemical pathway

or function are sometimes adjacent to each other on the same DNA strand (not necessarily the same reading frame), allowing them to be co-regulated This group of genes is called an “operon” Operons only exist in bacteria; they are not present in eukaryotes

at all.

Finding Genes

First job is to find long ORFs, examining the longest ORFs first and putting together a set with minimal overlaps. It is also necessary to identify potential start codons, with the

furthest upstream start codon as the easiest choice.

Then, how do we know that the ORF contains a real gene? The most definitive way is to match it with a gene known from other species conservation of a sequence between species strongly suggests that

the sequence has a function that is being conserved by natural selection

We compare protein sequences, not DNA, because protein is more conserved in evolution than DNA The organism’s survival depends on the protein being functional,

which means having the proper amino acids sequence Since the genetic code is degenerate, many different DNA

sequences will give identical proteins. The protein 3-dimensional structure is even more conserved,

because it is more closely related to enzyme activity than the amino acid sequence is. However, we don’t have good ways of determining 3-D structure

from a DNA sequence

Sequence Comparison

So, we compare our ORF sequence to a database of known protein sequences from many species. BLAST is the standard sequence alignment tool (BLAST =

Basic Local Alignment Search Tool) BLAST is based on the concept that if you compare the

same (that is, homologous) protein from many different species, you can see that some amino acids readily substitute for each other and others almost never do. A substitution matrix, giving a score for each amino acid

position in the proteins being compared.

Practical BLAST

BLAST itself is a bit of software that can be run on almost any computer, but the database needed for a good cross-species comparison is quite large the database is called “nr” for “non-redundant”, and it

contains at least 20 Gb of sequence data We are going to use the BLAST service at UniProt, a

European consortium that contains a comprehensive collection of protein sequences http://www.uniprot.org/ Nearly all derived from DNA sequences: direct sequencing of

proteins is difficult Terminology: your sequence, which you paste into the box

on the web site, is the query sequence. Sequences in the database that match yours are called subject sequences.

A Sequence to BLAST This is a more-or-less

randomly chosen gene from B. meg. It is 174 amino acids long

It is written in “fasta” format: the first line starts with > and is immediately followed by an identifier (ORF00135), and then some miscellaneous comments.

After that the sequence is written without spaces or other marks.

>ORF00135 |chromosome 538197-538721 revcomp MKAKLIQYVYDAECRLFKSVNQHFDRKHLNRFLRLLTHAGGATFTIVIACLLLFLYPSSVAYACAFSLAVSHIPVAIAKKLYPRKRPYIQLKHTKVLENPLKDHSFPSGHTTAIFSLVTPLMIVYPAFAAVLLPLAVMVGISRIYLGLHYPTDVMVGLILGIFSGAVALNIFLT

Results

BLAST Scores Results are arranged with the best ones on top The most important score is the Expect value, or E-value, which can be

defined the number of hits any random sequence (with the same length as yours) would have in the database. E-values for good hits are usually written something like: 3e-42,

which is the same as 3 x 10-42 , a very small number Bad hits are very common, and they have e-values in a more

familiar form: for example, 0.004 or 1.2 A really good e-values is less than 1e-180, which underflows the

computer’s processing capabilities, so it written as 0.0 E-values are affected by the length of the query sequence as well

as the size of the database, so even perfect matches with short sequences give poor e-values

In this case we see many hits with good e-values, and the top e-values all are quite similar.

Before we can conclude that our protein is a homologue of the proteins BLAST matches it with, we would like them to have roughly the same length and have a high percentage of identical amino acids. the lengths of the query and subject sequences should be within

20% of each other There should be at least 30% identical amino acids In this case we can be quite sure we have a good match

BLAST also returns a fourth value, the bit score, which we are going to ignore.

Gene Names

Mostly genes are named with the function of their protein. at some point, some related genes had their function determined

through lab work: by examining the effects of mutations in the gene, by isolating and studying the protein produced by the gene, etc.

Enzymes (end in –ase), transport across the cell membrane, genetic information processing (DNA->RNA->protein), structural proteins, sporulation and germination, and more!

Many genes (maybe 1/4 of them in a typical genome) have no known function, although they are found in several different species: conserved hypothetical genes

Every new genome has some genes that are unique: no matching BLAST hits in the database. Are they real genes? Sometimes there is evidence in the form of

messenger RNA, but usually we don’t know call them hypothetical genes

“putative” means that we think we know the gene’s function but we aren’t sure. Putative should be followed by the function name.

More Gene Names

One question of interest: do the names of the top BLAST hits agree with each other? They should, but there are always annotation errors, and our knowledge of gene function increases over time. With some sloppiness due to different naming

conventions practiced by different scientists Here we have a classic case of mis-naming. Why

is the top hit ribosomal protein S2, with no other hit having this name? Ribosomal proteins are highly conserved in evolution Some checking on my part showed that no homology

exists between this gene and the ribosomal protein S2 found in any other Bacillus species

The other names are similar, although not identical. What is “PAP2”? A quick Google search shows that it

stands for “phosphatidic acid phosphatase”, which fits the other names well.

There is probably some uncertainty about its exact function, given the variety of names and the “family protein” designation in several of them.

Horizontal and Vertical Gene Transfer

We are accustomed to thinking of genes being passed from parent to offspring, always staying within the species, with very occasional splitting of one species into two. This is called vertical gene transfer.

But, we know that some genes are transferred across species lines, not by the standard genetic mechanisms. This is called horizontal gene transfer It is rare in humans and other higher

organisms In bacteria 10% or more of genes have been

transferred in horizontally. B meg genes that come from vertical

descent have other Bacillus species (or another closely related species) as the closest BLAST hit

Horizontally transferred genes can come from almost anywhere: other bacteria, Archaea, eukaryotes: plants, animals, fungi The general mechanisms are well known,

including conjugation (direct transfer of DNA between two bacteria), transduction (transfer of DNA using a virus as a carrier), and transformation (the bacteria pick up DNA molecules from their environment.

Bacillus Phylogeny

“Kings Play Chess On Fine Ground Sand”

Bacteria is the domain

Firmicutes is the phylum

Bacilli is the class Bacillales is the order Bacillaceae is the

family Bacillus is the genus.

Our Example

Most of the top hits are from various Bacillus species: there is little doubt that this gene is the results of normal, vertical gene flow.

What about “Anoxybacillus flavithermus”? Click on the accession number to get more information,

including its phylogeny. Taxonomic lineage = Bacteria > Firmicutes > Bacillales >

Bacillaceae > Anoxybacillus. Same family as B meg.

Aligned Sequences

You can see the aligned sequences by clicking on the “Local alignment” diagrams Query sequence on top, subject below Identical amino acids are in the middle of the

alignment, and similar ones have a + sign. Gaps: regions where one sequence has amino acids

not found in the other sequence, are indicated with ---.

This protein is very typical in that the best matches are in the middle of the protein, with fewer identical amino acids near the ends. Also, the match doesn’t quite make it to the very

beginning of the proteins, although they are almost identical in length.

The active site of most enzymes is in the middle The ends of proteins are often not well conserved

Local Alignment Result

Graphical Overview

Click on Graphical Overview (just under the BLAST box on the left) to get an overview of all the aligned sequences The extent of the matching region

is shown with the colored boxes, with non-matching regions drawn as a line.

Color indicates percent of identical amino acids

You can see that mostly our query and the various subjects (matches) line up along almost all of their lengths. This is a good way to check

whether our start site is reasonable.

A few odd ones lower down. Genes, and pieces of genes, can

move to new locations in the genome, fuse with other genes, break apart, etc. Always subject to natural selection: if the altered gene doesn’t work, the organism will die and we won’t see it.

And of course, sequencing and annotation errors occur.

The Basic Points1. DNA can be read in 3 different reading

frames, a consequence of the genetic code (3 bases = 1 amino acid)

2. Genes are found in long open reading frames, areas where there are no stop codons.

3. BLAST is the tool we use to compare sequences between species

• BLAST scores (e-values) describe the probability of finding a random sequence in the database

4. Gene sequences are conserved between species by natural selection

• DNA sequences outside of genes are much less conserved

5. Most genes are transferred vertically, from parent to offspring, but a significant number are transferred horizontally, from unrelated species).

Thank You

Email me : [email protected]

bioinformatics

Health & Medicine