speaker: sean d. mooney date: august 27, 2015 bioinformatics i date: august 27, 2015

44
Speaker: Sean D. Mooney Date: June 23, 2022 Bioinformatics I Date: June 23, 2022

Upload: lee-gardner

Post on 25-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Bioinformatics I

Date: April 19, 2023

Page 2: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Learning Objectives

• To have an understanding of:– Sequence analysis– Genome assembly and annotation– Publicly available molecular biology and genetic databases – Identification of sequence similarity– Defining function of biological sequences– Using sequence information to hypothesize function

Page 3: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Bioinformatics Introduction

• Bioinformatics is the intersection of computer science, statistics and one of the following life science disciplines: biology, biochemistry or medicine

Computer

ScienceStatisticsBioinformatics

Page 4: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

What is bioinformatics?• Clinical Informatics

– Systems used to deal with patient data– Clinical trial management systems, electronic health records, etc.

• Laboratory Informatics– Systems to deal with scientific instruments and data management– Connecting instruments together, managing laboratory flow, etc.

• Bioinformatics– Systems to deal with basic research data– DNA, proteins, ‘molecular’ things

Page 5: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Sequence Analysis

• DNA sequencing is a common activity, so analysis of biological sequences has become critical to modern molecular biology and genetics.

• What kind of sequences?– DNA– Protein– RNA

• Today we will learn about how these sequences are managed and used by biologists

Page 6: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Storing A Sequence On A Computer

• FASTA Format:

>TITLE SEQUENCE 1

SEQUENCE1VSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWELLQASAFCGLGFLIVLALFQAGL

>TITLE SEQUENCE 2

SEQUENCE2ASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWELLQASAFCGLGFLIVLALFQAG

>TITLE SEQUENCE 3

AINKIQDFLQKQEYKTLEYNLTTTEVVMENVTAFWEEGFGELFEKAKQNNNNRKTSNGDDSLFFSNFSLLGTPVLKDINFKIERGQL

Page 7: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Actual FASTA File• >gi|90421312|ref|NM_000492.3| Homo sapiens cystic fibrosis transmembrane

conductance regulator (ATP-binding cassette sub-family C, member 7) (CFTR), mRNA AATTGGAAGCAAATGACATCACAGCAGGTCAGAGAAAAAGGGTTGAGCGGCAGGCACCCAGAGTAGTAGG TCTTTGGCATTAGGAGCTTGAGCCCAGACGGCCCTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAG GTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGCTGGACCAGACCAATTTTGAGGAAA GGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTAT CTGAAAAATTGGAAAGAGAATGGGATAGAGAGCTGGCTTCAAAGAAAAATCCTAAACTCATTAATGCCCT TCGGCGATGTTTTTTCTGGAGATTTATGTTCTATGGAATCTTTTTATATTTAGGGGAAGTCACCAAAGCA GTACAGCCTCTCTTACTGGGAAGAATCATAGCTTCCTATGACCCGGATAACAAGGAGGAACGCTCTATCG CGATTTATCTAGGCATAGGCTTATGCCTTCTCTTTATTGTGAGGACACTGCTCCTACACCCAGCCATTTT TGGCCTTCATCACATTGGAATGCAGATGAGAATAGCTATGTTTAGTTTGATTTATAAGAAGACTTTAAAG CTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACA AATTTGATGAAGGACTTGCATTGGCACATTTCGTGTGGATCGCTCCTTTGCAAGTGGCACTCCTCATGGG GCTAATCTGGGAGTTGTTACAGGCGTCTGCCTTCTGTGGACTTGGTTTCCTGATAGTCCTTGCCCTTTTT

Page 8: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

GenBank

• GenBank is the public domain database of biological (RNA, DNA, protein) sequences.

• If you sequence a novel sequence, you can submit it!

• Started at Los Alamos National Laboratory– Moved to the National Center for Biotechnology

Information in 1993– Stores almost every type of sequence possible

Page 9: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Using GenBank• You can access GenBank at

http://www.ncbi.nlm.nih.gov

Page 10: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

GenBank

• Now bigger than 100,000,000,000 bases

• 240,000 named organisms

• More than 60 million records

http://woldlab.caltech.edu/biohub/scipy2006/genbankgrowth.jpg

Page 11: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

The need for quality: RefSeq• GenBank is an uncurated mess…

• RefSeq – the reference sequence project. This is a subset of GenBank, and a curated set of sequences and annotations

• RefSeq IDs have the form of “Letter-Letter-Underscore-Number.” The letter-letter-underscore prefixes are:– “NM_” - mRNA– “NP_” - protein– “NT_” – genomic (automated)– “XP_” – genomic protein (automated), etc..

Page 12: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

When you look for a sequence…• It will come back to

you in something like the GenBank file format.

Page 13: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

What are the important elements of GenBank format?

LOCUS NM_000492 6132 bp mRNA linear PRI 07-OCT-2007DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7) (CFTR), mRNA.ACCESSION NM_000492VERSION NM_000492.3 GI:90421312KEYWORDS .SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 6132) AUTHORS Pall,H., Zielenski,J., Jonas,M.M., DaSilva,D.A., Potvin,K.M., Yuan,X.W., Huang,Q. and Freedman,S.D. TITLE Primary sclerosing cholangitis in childhood is associated with abnormalities in cystic fibrosis-mediated chloride channel function JOURNAL J. Pediatr. 151 (3), 255-259 (2007) PUBMED 17719933 REMARK GeneRIF: There is a high prevalence of CFTR-mediated ion transport dysfunction in subjects with childhood primary sclerosing cholangitis..........

Page 14: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

LOCUS NM_000492 6132 bp mRNA linear PRI 07-OCT-2007DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7) (CFTR), mRNA.ACCESSION NM_000492VERSION NM_000492.3 GI:90421312KEYWORDS .SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 6132) AUTHORS Pall,H., Zielenski,J., Jonas,M.M., DaSilva,D.A., Potvin,K.M., Yuan,X.W., Huang,Q. and Freedman,S.D. TITLE Primary sclerosing cholangitis in childhood is associated with abnormalities in cystic fibrosis-mediated chloride channel function JOURNAL J. Pediatr. 151 (3), 255-259 (2007) PUBMED 17719933 REMARK GeneRIF: There is a high prevalence of CFTR-mediated ion transport dysfunction in subjects with childhood primary sclerosing cholangitis..........

Accession: NM_000492Version: 3

GenBank ID:90421312Symbol: CFTR

What are the important elements of GenBank format?

Page 15: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

What do you need to know?

• The gene ID or symbol is not a cite-able sequence!

• Without the version, the RefSeq ID is not unique!

• The GI is always unique, if a change occurs to the entry, a new GI is issued. When in doubt, use the GI!

Page 16: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

FEATURES Location/Qualifiers CDS 133..4575 /gene="CFTR"

/protein_id="NP_000483.3" /db_xref="GI:90421313" /db_xref="CCDS:CCDS5773.1" /db_xref="GeneID:1080" /db_xref="HGNC:1884" /db_xref="HPRD:03883" /db_xref="MIM:602421" /translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL GRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLI YKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWEL LQASAFCGLGFLIVLALFQAGLGRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYC WEEAMEKMIENLRQTELKL......

GenBank files contain annotation

Page 17: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

FEATURES Location/Qualifiers CDS 133..4575 /gene="CFTR"

/protein_id="NP_000483.3" /db_xref="GI:90421313" /db_xref="CCDS:CCDS5773.1" /db_xref="GeneID:1080" /db_xref="HGNC:1884" /db_xref="HPRD:03883" /db_xref="MIM:602421" /translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL GRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLI YKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWEL LQASAFCGLGFLIVLALFQAGLGRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYC WEEAMEKMIENLRQTELKL......

Note protein translationis present! There is alsoanother protein record

for this entry (NP_000483.3)

Page 18: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Why so many IDs? Biology!

Gene Symbol (CFTR)

GeneID – Unique to organism (1080)

Transcript ID -1•Transcript 1 version1•Transcript 1 version2•Transcript 1 version3

Transcript ID -2 etc•Transcript 2 version1•Transcript 2 version2

Page 19: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

The heirarchy of names• Gene symbol/name is the official alphanumeric name of a gene

– It is defined by Human Genome Organization (HUGO)– It generally does not reliably tell you the species or describe a transcript– Example: CFTR, ER, AR, BRCA1

• Gene Identifier– A gene identifier, generally a number, allows you to uniquely link to the gene and species– Common identifiers:

• Locus Link or Gene – NIH identifiers• Mouse Genome Informatics IDs (MGI) – Mouse identification• Ensembl IDs – European gene identifiers

• Transcript (the mRNA product of gene)– RefSeq– GenBank

• Protein– Next topic!

Page 20: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Protein databases• The most popular protein database is called Uniprot

• Uniprot is a superset of Swiss-Prot, at http://uniprot.org

• Name is CFTR_HUMAN, Accession is Letter+Number (P13569)

Page 21: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Some simple rules for publication

• When publishing genes or proteins:

– When publishing information about a gene, use the HUGO approved symbol. If publishing information specifically about the gene, put the gene ID (preferably the locus link or entrez gene id).

– When publishing information about a specific transcript of a gene, use the RefSeq ID. If including specifics about the sequence, include the version

– Uniprot is a great way to reference proteins.

Page 22: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

What next?

• We now know how to find and cite a sequence from a database

• We know that genomes, genes, transcripts and proteins are treated differently

• How do we use sequences to do things like predict function?

• Sequence analysis!

Page 23: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Sequence analysis is based on the sequence alignment!

Given two sequences of letters, and a scoring scheme for evaluating matching letters, find the optimal pairing of letters from one sequence to letters of the other sequence. (The sequence analysis slides were adapted from Russ Altman’s bioinformatics course, Stanford University)

Align:THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.THIS IS A SHORT SENTENCE.

THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.THIS IS A ------SHORT-- SENTENCE--------------.ORTHIS IS A SHORT---------SENTENCE--------------.

Page 24: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Aligning biological sequences

• DNA (4 letter alphabet + gap)

• TTGACAC• TTTACAC

• Proteins (20 letter alphabet + gap)

• RKVA--GMAKPNM• RKIAVAAASKPAV

Page 25: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Statement of Problem

• Given:– 2 sequences– scoring system for evaluating match(or mismatch) of

two characters – penalty function for gaps in sequences

• Produce:– Optimal pairing of sequences that

• retains the order of characters• introduces gaps• maximizes total score

Page 26: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Why align sequences?

• Lots of sequences with unknown structure and function. A few sequences with known structure and function.

• If they align, they are similar, maybe due to common descent.

• If they are similar, then they might have same structure or function.

• If one of them has known structure/function, then alignment to the other yields insight about how the structure or function works.

Page 27: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Alignments have parametersExact Matches OK, Inexact Costly, Gaps cheap.This is a rather longer sentence than the next.This is a rather longer sentence than the next.This is a ------------- sentence--------------.This is a ------------- sentence--------------.

OROR

This is a *rather longer sentence than the next.This is a *rather longer sentence than the next.This is a s---h----o---rtsentence--------------.This is a s---h----o---rtsentence--------------.

Exact Matches OK, Inexact Moderate, Gaps cheap.This is a rather longer sentence than the next.This is a rather longer sentence than the next.This is a --short------ sentence--------------.This is a --short------ sentence--------------.

Exact Matches cheap, Inexact cheap, Gaps expensive.This is a rather longer sentence than the next.This is a rather longer sentence than the next.This is a short sentence.----------------------This is a short sentence.----------------------

Page 28: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Multiple alignment

• Pairwise alignment (two at a time) is much easier than multiple alignment (N at a time). To be discussed later.

This is a rather longer sentence than the next.This is a short sentence.This is the next sentence.Rather long is the next concept.Rather longer than what is the next concept.

Page 29: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Multiple Alignment

This is a rather longer sentence than --------the next---------.This is a short-------- sentence-------------------------------.This is --------------------------------------the next sentence.----------Rather long is --------the------------- next concept-.----------Rather longer ---------than what is the next concept-.

Page 30: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Dot Plots To Visualize Sequence Similarity

• Put one sequence along the top row of a matrix.

• Put the other sequence along the left column of the matrix.

• Plot a dot everytime there is a match between an element of row sequence and an element of the column sequence.

• Diagonal lines indicate areas of match.

Page 31: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Page 32: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Problems with dot matrices

• Rely on visual analysis

• Difficult to find optimal alignments

• Need scoring schemes more sophisticated that “identical match”

• Difficult to estimate significance of alignments

Page 33: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

The Dynamic Programming Algorithm

• Sequence alignment generally uses an algorithm called dynamic programming

• Dynamic programming allows for fast, optimal alignment of sequences

• Very informally, dynamic programming finds the best path through a dot plot.

Page 34: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Substitution Matrices

• The degree of match between two letters can be represented in a matrix.

• Area of active research

• Changing matrix changes alignments– context-specific matching– information theoretic interpretation of scores– modeling evolution with different matrices

Page 35: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

A Sample Match Matrix for the amino acids (PAM-250).

Page 36: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Where do matrices come from?

• 1. Manually align protein structures (or, more risky, sequences)

• 2. Look for frequency of amino acid substitutions at structurally nearly constant sites.

• 3. Entry ~ log(freq(observed)/freq(expected))•• + —> More likely than random• 0 —> At random base rate• - —> Less likely than random

Page 37: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Global vs. Local Alignment

• Global alignment: find alignment in which total score is highest, perhaps at expense of areas of great local similarity.

• Local alignment: find alignment in which the highest scoring subsequences are identified, at the expense of the overall score.

Page 38: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Sequence alignment

• Two types of alignment approaches for DNA and Protein– Global alignments• Originally described with the Needleman-Wunsch

algorithm [1970] using dynamic programming• Optimally finds an alignment between two sequences• ClustalW is software for global alignment

– Local alignments• Originally described with the Smith-Waterman

algorithm [1981], again using dynamic programming• Optimally finds subsequence alignments

Page 39: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Uses of sequence alignment

• Searching databases• Assembling genome sequences• Annotation transfer: similar sequences implies similar function• Constructing multiple sequence alignments of groups of sequences• Understanding evolutionary relationships between sequences• DNA/RNA hybridization probe design• Peptide and protein identification in proteomic experiments

• For more detail on sequence alignment algorithms and their statistics see: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin, Eddy, Krogh and Mitchison (Cambridge Press)

Page 40: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Global vs. Local AlignmentTTGACACCCTCCCAATTGTA

ACCCCAGGCTTTACACAT

TTGACACCCTCCCAATTGTA--- |||| || |-----ACCCCAGGCTTTACACAT

---------TTGACACCCTCCCAATTGTA || ||||ACCCCAGGCTTTACACAT-----------

(from Gribskov/Devereaux page 133)

Page 41: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Local alignment asks different question than Global alignment.

• NOT: Are these two sequences generally the same? (Global question)

• BUT: Do these two sequences contain high scoring subsequences? (Local question)

Local similarities may occur in sequences with different structure or function that share common substructure/subfunction. MOTIFS…

Page 42: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Obtaining a Global Alignment• ClustalW, http://www.ebi.ac.uk/Tools/clustalw/index.html

• Enter sequences in FASTA format!

• Parameters: penalties for gaps, substitution matrix, etc.

Page 43: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Local Alignment?• We will learn about local alignment tomorrow.

Page 44: Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: April 19, 2023

Summary• Biological sequences are stored in databases with annotations

relevant to their function

• The primary sequence database is GenBank, hosted at the National Institutes of Health

• The providers of these databases have come up with database IDs that we should use when writing papers for genes, transcripts and proteins

• Sequence analysis allows us to compare two sequences through sequence alignments which come in two flavors, local and global