![Page 1: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/1.jpg)
Sequence Analysis (I)
Yuh-Shan Jou ( 周玉山 )[email protected]
Institute of Biomedical Sciences, Academia Sinica
![Page 2: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/2.jpg)
Bioinformatics• Bioinformatics is the application of information
technology to analyze, process, and manage biological data.
• Bioinformatics provides computational tools to facilitate the process of
Data Information Knowledge Discovery
Don’t believe everything you see in DB or even in GenBank!QC is the most important aspect and concern in Bioinformatics!
![Page 3: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/3.jpg)
Roadmap to Genomics
Transcriptional Map of human Genome
Sequencing ofHuman Genome
Human Genome
Human GenomePhysical Maps
Database ESTs (dbEST)
1. Markers: EST: Expressed Sequence Tag. STS: Sequence Tag Site. STR: Short Tandem Repeat.2. genomic DNA contigs: Cosmid contigs YAC contigs
cDNA sequencing
1. BAC or PAC contigs2. Sequencing technologies
Radiation HybridsMapping Panels
Posit ionalCloning
Positional Candidate Approaches
*Diagnosiswith GeneChips
*1. Expression patterns*2. Expression profiles*3. Microarray of genes
Diseases Markersfor diagnosis
Positional Candidate Approaches
Sho
tgun
seq
uen c
ing
Full length cDNAs
Functional Genomics
![Page 4: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/4.jpg)
A Vision for the Future of Genome ResearchFrancis S. Collins (National Human Genome Research Institute, NIH, USA)
Nature 422:835 (2003)
![Page 5: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/5.jpg)
EBI
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
International Sequence Database Collaboration
![Page 6: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/6.jpg)
Lecture 7.1 6
www.ensembl.org
![Page 7: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/7.jpg)
http://genome.ucsc.edu
![Page 8: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/8.jpg)
Data Bases and Scientific Algorithms Data Bases and Scientific Algorithms
Integration Bioinformatics
Medline(Asn.1)
Medline(Asn.1)
Entrez/NCBI(Asn.1)
Entrez/NCBI(Asn.1)
PDB(Oracle, 3D images)
PDB(Oracle, 3D images)
BLAST(FASTA)
BLAST(FASTA)
IntegrationIntegrationBioInformaticsBioInformatics
IntegrationIntegrationBioInformaticsBioInformatics
ClustalW(FASTA)
KEGG (HTML Text,
Binary Images)
OMIN(Text File)
Microarray Data(RDBMS, Excel)
![Page 9: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/9.jpg)
Web Access: www.ncbi.nlm.nih.gov
![Page 10: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/10.jpg)
NCBI Web Traffic
Christmas and New Year’s Day
User’s per day
![Page 11: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/11.jpg)
The Entrez System: Text Searches
![Page 12: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/12.jpg)
Types of Databases
• Primary Databases– Original submissions by experimentalists– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain
![Page 13: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/13.jpg)
Entrez Nucleotides
Primary • GenBank / EMBL / DDBJ 49,675,750
Derivative• RefSeq 545,503
• Third Party Annotation 4,544
• PDB 5,561 Total 50,231,358
![Page 14: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/14.jpg)
Entrez Protein: Derivative Databases
GenPept 3,950,968
RefSeq 1,348,072
Third Party Annotation 4,133
Swiss Prot 170,087
PIR 282,821
PRF 12,079
PDB 61,845
Total 5,830,005
BLAST nr total 2,336,522
![Page 15: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/15.jpg)
GenBank Growth
0
5
10
15
20
25
30
35
40
45
50
Jun
-82
Jun
-84
Jun
-86
Jun
-88
Jun
-90
Jun
-92
Jun
-94
Jun
-96
Jun
-98
Jun
-00
Jun
-02
Jun
-04
Date
Ba
se
Pa
irs
(b
illi
on
s)
0
5
10
15
20
25
30
35
40
45
50
Rec
ord
s (m
illio
ns)
BasepairsRecords
The Growth of GenBank
Release 148: 45.2 million records 49.4 billion nucleotides
Average doubling time ≈ 14 months*
![Page 16: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/16.jpg)
Organization of GenBank:Traditional Divisions
Records are divided into 17 Divisions.11 Traditional 6 Bulk
Traditional Divisions: Traditional Divisions: • Direct Submissions (Sequin and BankIt)
• Accurate• Well characterized
PRI (28) Primate PLN (13) Plant and FungalBCT (11) Bacterial and Archeal INV (7) InvertebrateROD (15) RodentVRL (4) ViralVRT (7) Other VertebrateMAM (1) Mammalian PHG (1) PhageSYN (1) Synthetic (cloning vectors) UNA (1) Unannotated
Entrez query: gbdiv_xxx[Properties]
![Page 17: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/17.jpg)
Organization of GenBank:Bulk Divisions
Records are divided into 17 Divisions.11 Traditional 6 Bulk
BULK Divisions: BULK Divisions: • Batch Submission (Email and FTP)
• Inaccurate• Poorly characterized
EST (355) Expressed Sequence Tag GSS (132) Genome Survey SequenceHTG (62) High Throughput GenomicSTS (5) Sequence Tagged SiteHTC (6) High Throughput cDNAPAT (17) Patent
Entrez query: gbdiv_xxx[Properties]
![Page 18: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/18.jpg)
File Formats of theSequence Databases
Each sequence is represented bya text record called a flat file.
GenBank/GenPept (useful for scientists) FASTA (the simplest format)
ASN.1 & XML (useful for programmers)
![Page 19: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/19.jpg)
A TraditionalGenBank
Record
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt//
Header
Feature Table
Sequence
The Flatfile Format
![Page 20: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/20.jpg)
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
The Header
![Page 21: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/21.jpg)
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Locus LineLOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
Molecule typeMolecule typeDivisionDivision
Modification DateModification Date
Locus nameLocus name
LengthLength
![Page 22: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/22.jpg)
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Database Identifiers
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
Accession•Stable•Reportable•Universal
Accession•Stable•Reportable•Universal
VersionTracks changes in sequenceVersionTracks changes in sequence
GI numberNCBI internal useGI numberNCBI internal use
![Page 23: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/23.jpg)
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Organism
SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
NCBI-controlled taxonomy
![Page 24: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/24.jpg)
FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"
The Feature Table
Coding sequenceCoding sequence
start (atg)start (atg) stop (tag)stop (tag)
ImpliedproteinImpliedprotein
GenPept Identifiers
![Page 25: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/25.jpg)
The Sequence: 99.99% Accurate
ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//
1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//
![Page 26: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/26.jpg)
>gi|30256|emb|CAA42556.1| c-src-kinase [Homo sapiens] MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL
MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL
FASTA Format
>gi|30256|emb|CAA42556.1| c-src-kinase [Homo sapiens]
>gi number
Database Identifiers:gb GenBankemb EMBLdbj DDBJref RefSeqsp SWISS-PROTpdb Protein Databankpir PIRprf PRFtpg TPA-GenBanktpe TPA-EMBLtpj TPA-DDBJ
Accession.Version Locus Name Organism
![Page 27: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/27.jpg)
Seq-entry ::= set { class nuc-prot , descr { title "Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds." , source { org { taxname "Malus x domestica" , common "cultivated apple" , db { { db "taxon" , tag id 3750 } } , orgname { name binomial { genus "Malus" , species "x domestica" } , mod { { subtype cultivar , subname "'Law Rome'" } , { subtype old-name , subname "Malus domestica" , attrib "(10)cultivar='Law Rome'" } } , lineage "Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus" , gcode 1 ,,
Abstract Syntax Notation: ASN.1
FASTA NucleotideFASTA Nucleotide
FASTAProteinFASTAProtein
GenPeptGenPept GenBankGenBank
ASN.1ASN.1
![Page 28: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/28.jpg)
Bulk Divisions
• Expressed Sequence Tag– 1st pass single read cDNA
• Genome Survey Sequence– 1st pass single read gDNA
• High Throughput Genomic– incomplete sequences of genomic clones
• Sequence Tagged Site– PCR-based mapping reagents
•Batch Submission and htg (email and ftp)•Inaccurate•Poorly Characterized
![Page 29: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/29.jpg)
EST Division: Expressed Sequence Tags
RNA gene products
nucleus30,000 genes
80-100,000 uniquecDNA clones in library
- isolate unique clones -sequence once from each end
make cDNA library
5’
3’
>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
gbdiv_est[Properties]
![Page 30: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/30.jpg)
ESTs in EntrezTotal 26 million recordsHuman 6.0 millionMouse 4.3 millionRat 0.7 millionZebrafish 0.6 millionWheat 0.6 millionBarley 0.3 millionMaize 0.4 million
Total 26 million recordsHuman 6.0 millionMouse 4.3 millionRat 0.7 millionZebrafish 0.6 millionWheat 0.6 millionBarley 0.3 millionMaize 0.4 million
![Page 31: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/31.jpg)
Genome Sequencing - HTG, GSS, (WGS)
Draft Sequence (HTG division)
shredding
Whole BAC insert (or genome)
cloning isolating
assembly
sequencing
GSS divisionor trace archive
whole genome shotgun assemblies (traditional division)
![Page 32: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/32.jpg)
HTG Division: Rice Draft Sequences
•Unfinished sequences of BACs•Gaps and unordered pieces•Finished sequences move to traditional GenBank division
![Page 33: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/33.jpg)
Whole Genome Shotgun Projects• Traditional GenBank Divisions• 200 + projects
– Virus
– Bacteria
– Environmental sequences
– Archaea
– 51 Eukaryotes featuring:• Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human
• Pufferfish (2)
• Honeybee, Anopheles, Fruit Flies (3), Silkworm
• Nematode (C. briggsae)
• Yeasts (8), Aspergillus (2)
• Rice
![Page 34: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/34.jpg)
Zebrafish: WGS
wgs_master[Properties]wgs_master[Properties]
![Page 35: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/35.jpg)
Derivative DatabasesUniGeneRefSeq
TPA
![Page 36: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/36.jpg)
Primary vs. DerivativeSequence Databases
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATTAT
TC
CGAGA
ATTAT
TC
C
AT
GAGA
ATTC
C GAGA
ATTC
C
TTGACAAT
TGACTA
ACGTGC
TTGACA
CGTGAATTGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCGTATAGCCG
TATAGCCG TATAGCCGTATAGCCG TATAGCCGCAT
T
GAGA
ATTC
C GAGA
ATTC
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continuall
y by NCBI
Updated ONLY by submitters
![Page 37: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/37.jpg)
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of mapping reagents
What is UniGene?
![Page 38: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/38.jpg)
EST hits: Human mRNA
Albumin mRNAAlbumin mRNA
5’ EST hits5’ EST hits
3’ EST hits3’ EST hits
![Page 39: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/39.jpg)
UniGene: Expressed Sequences
![Page 40: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/40.jpg)
Expression Data
![Page 41: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/41.jpg)
RELEASE 11 (May 13, 2005)AVAILABLE ON THE FTP SITE!
• Forming the “best representative” sequence
• Standardizing nomenclature and record structure
• Adding annotation (references, sequence features)
• Stable reference for example, gene identification, polymorphism discovery, comparative analysis
• RefSeq Release 11 includes over 1,425,971 proteins and 2928 organisms.
• The release is available by FTP at: ftp://ftp.ncbi.nih.gov/refseq/release/
• RefSeq number is still not fixed.
srcdb_refseq[Properties]
![Page 42: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/42.jpg)
Curated RefSeq Records
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from X66503.1. Summary: Adenylosuccinate synthetase catalyzes the first committed step in the conversion of IMP to AMP.
LOCUS ADSS 1368 bp mRNA linear PRI 27-AUG-2002 DEFINITION Homo sapiens adenylosuccinate synthase (ADSS), mRNA. ACCESSION NM_001126 VERSION NM_001126.1 GI:4557270 RefSeq NucleotideRefSeq Nucleotide
LOCUS ADSS 455 aa linear PRI 27-AUG-2002 DEFINITION adenylosuccinate synthase; Adenylosuccinate synthetase (Ade(-)H-complementing) Homo sapiens . ACCESSION NP_001117 VERSION NP_001117.1 GI:4557271 DBSOURCE REFSEQ: accession NM_001126.1 RefSeq ProteinRefSeq Protein
X records:X records: Genome Annotation & Inferred or PredictedGenome Annotation & Inferred or Predicted vs vs N records:N records: Provisional, Reviewed or ValidatedProvisional, Reviewed or Validated
![Page 43: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/43.jpg)
RefSeq Accession Numbers
mRNAs and Proteins
NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAssembliesNT_123456 Contig NW_123456 WGS Supercontig
![Page 44: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/44.jpg)
Curated genomic DNACurated genomic DNA(NC, NT, NW)(NC, NT, NW)
Curated Model mRNACurated Model mRNA (XM)(XM)(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)(NR)(NR)
Model protein Model protein (XP)(XP)
RefSeq Curation Processes
ProteinProtein (NP)(NP)
Scanning....
http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions
![Page 45: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/45.jpg)
RefSeq: NCBI’s Derivative Sequence Database
• Curated transcripts and proteins– reviewed– human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more
• Model transcripts and proteins• Assembled Genomic Regions (contigs)
– human genome– mouse genome– rat genome
• Chromosome records– Human genome– microbial– organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
srcdb_refseq[Properties]
![Page 46: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/46.jpg)
RefSeq Benefits
• non-redundancy • explicitly linked nucleotide and protein sequences• updates to reflect current sequence data and
biology• data validation • format consistency• distinct accession series • stewardship by NCBI staff and collaborators
![Page 47: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/47.jpg)
Third Party Annotation (TPA) Database
• Annotations of existing GenBank sequences
• Allows for community annotation of genomes
• Direct submissions– BankIt – Sequin
tpa[Properties]
![Page 48: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/48.jpg)
TPA record: WGS Assembly
CDS FeatureTPA protein
![Page 49: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/49.jpg)
Human Nucleotide Sequences
ISDC 8,965,327(GenBank/EMBL/DDBJ)
PRI 916,017(WGS 601,855)EST 6,003,916GSS 905,645HTG 18,364HTC 49,373STS 117,870
PAT 953,269
RefSeq 35,934TPA 893Total 9,002,154
ISDC 8,965,327(GenBank/EMBL/DDBJ)
PRI 916,017(WGS 601,855)EST 6,003,916GSS 905,645HTG 18,364HTC 49,373STS 117,870
PAT 953,269
RefSeq 35,934TPA 893Total 9,002,154
![Page 50: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/50.jpg)
Other NCBI Databases
•dbSNP: nucleotide polymorphism
•Geo: Gene Expression Omnibusmicroarray and other
expression data
•Gene: gene recordsUnifies LocusLink and
Microbial Genomes
•Structure: imported structures (PDB)Cn3D viewer, NCBI
curation
•CDD: conserved domain databaseProtein families
(COGs)
Single domains (PFAM, SMART, CD)
![Page 51: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/51.jpg)
NCBI’s SNP Database
• Primary Database and Derivative (RefSNP)
• Single Nucleotide Polymorphism
• Repeat polymorphisms
• Insertion-Deletion Polymorphisms
• 24 Species
• Over 15 million submissions
![Page 52: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/52.jpg)
Submitted SNP
Hemachromatosis SNP
![Page 53: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/53.jpg)
•Non-redundant•Computational Analysis
•BLAST hits to•genome, mRNA, protein and structure
RefSNP
![Page 54: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/54.jpg)
![Page 55: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/55.jpg)
Sequence Similarity Searching
Basic Local Alignment Search Tool
(BLAST)
![Page 56: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/56.jpg)
BLAST
VAST
Pubmed
Text
Sequence
Structure
![Page 57: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/57.jpg)
• Best score for aligning part of sequences
• Dynamic programming • Algorithm:
Smith-Waterman• Table cells never score
below zero
• Best score for aligning the full length sequences
• Dynamic programming• Algorithm:
Needelman- Wunch• Table cells are allowed
any score
Global Local
Pairwise Alignment Summary
![Page 58: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/58.jpg)
Global vs Local Alignment
Seq 1
Seq 2
Seq 1
Seq 2
Global alignment
Local alignment
![Page 59: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/59.jpg)
Global Alignment
Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125
Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194
Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VAWorm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264
Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401
Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471
Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE +Worm: 472 SDPDKRPTFETLQWKLEDL 492
human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... .worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60
440 450human REQLEHI--------KTHELHL . .:: . : ...worm QWKLEDLFNLDSSEYKEASINF 500
Align program (Lipman and Pearson)
![Page 60: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/60.jpg)
Basic Local Alignment Search Tool
• Widely used similarity search tool
• Heuristic approach based on Smith Waterman algorithm
• Finds best local alignments
• Provides statistical significance
• All combinations (DNA/Protein) query and database.– DNA vs DNA
– DNA translation vs Protein
– Protein vs Protein
– Protein vs DNA translation
– DNA translation vs DNA translation
• www, standalone, and network clients
![Page 61: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/61.jpg)
What BLAST tells you• BLAST reports surprising alignments
– Different than chance
• Assumptions– Random sequences– Constant composition
• Conclusions– Surprising similarities imply evolutionary
homology
Evolutionary Homology: descent from a common ancestorDoes not always imply similar function
![Page 62: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/62.jpg)
BLAST/FASTA variants for different searches
Program Query Database Comparison Searching purpose
blastn/fasta
blastp/fasta
blastx/fastx
tblastn/tfasta
tblastx/tfastx
DNA
DNA
DNA
DNA
DNA
DNA
Protein Protein
Protein
Protein
DNA level
Protein level
Protein level
Protein level
Protein level
homologous DNA
homologous protein
New genes from DNA
New genes from peptide
New genes from DNA
BLAST Web site: http://www.ncbi.nlm.nih.gov/BLASTFASTA Web sites: http://www2.ebi.ac.uk/fasta3/
or http://www.fasta.genome.ad.jp/
![Page 63: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/63.jpg)
BLASTN Databases
nrGenBank, EMBL, DDBJ, PDB and NCBI reference sequences (RefSeq)
htgs High-throughput genomic sequences (draft)
pat Patented nucleotide sequences
mito Mitochondrial sequences
vector Vector subset of GenBank
month GenBank, EMBL, DDBJ, PDB from 30 days
chrom Contigs and chromosomes from RefSeq
![Page 64: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/64.jpg)
BLASTP Databases
nrGenBank CDS translations, RefSeq, PDB, SWISS-PROT, PIR, PRF
swissprot SWISS-PROT
pat Patented protein sequences
pdb Protein Data Bank
monthGenBank CDS translations, PDB, SWISS-PROT, PIR, PRF from 30 days
![Page 65: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/65.jpg)
GTACTGGACATGGACCCTACAGGAACGT
TGGACATGGACCCTACAGGAACGTATAC
CATGGACCCTACAGGAACGTATACGTAA . . .
Nucleotide Words
GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT . . .
Make a lookuptable of words
GTACTGGACATGGACCCTACAGGAACGTATACGTAAG Query
11-mer
1228megablast
711blastn
Min.Def.WORD SIZE
![Page 66: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/66.jpg)
Protein WordsGTQITVEDLFYNIATRRKALKNQuery:
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookuptable of words
Word size = 3 (default) Word size can only be 2 or 3
![Page 67: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/67.jpg)
Minimum Requirements for a Hit
•Nucleotide BLAST requires one exact match•Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
SEI YYN
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT
neighborhood words
exact word match
one match
two matches
![Page 68: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/68.jpg)
Query sequenceWords of length W
(1)
(2) Compare the word list to the database and identify exact matches
BLAST Algorithm
W default = 11
![Page 69: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/69.jpg)
(3) For each word match, extend alignment in both directions
(4) Compute E-value
![Page 70: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/70.jpg)
An alignment that BLAST can’t find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || || || ||| || | |||||| || | |||||| ||||| | |
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || || | | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Here there are no words longer than 6…...for nucleotides
there must be an exact match of at least 7.
![Page 71: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/71.jpg)
An Alignment BLAST Can MakeSolution: compare protein sequences; BLASTX
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
BLAST 2 Sequences (blastx) output:
![Page 72: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/72.jpg)
Nucleotide vs. Protein BLAST
aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggcH.sapiens: N R V T V V L G A Q W G D E G + + V + V L G Q W G D E GA.thaliana: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt
Comparing ADSS from H. sapiens and A. thaliana
BLASTp finds three matching wordsBLASTn finds no match, because there are no 7 bp words
Protein searches are generallymore sensitive than nucleotide searches.
![Page 73: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/73.jpg)
The Flavors of BLAST
• Standard BLAST– traditional “contiguous” word hit– position independent scoring – nucleotide, protein and translations (blastn, blastp,
blastx, tblastn, tblastx)• Megablast
– optimized for large batch searches– can use discontiguous words
• PSI-BLAST– constructs PSSMs automatically; uses as query– very sensitive protein search
• RPS BLAST– searches a database of PSSMs– tool for conserved domain searches
![Page 74: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/74.jpg)
Megablast: NCBI’s Genome Annotator
• Long alignments for similar DNA sequences
• Concatenation of query sequences
• Faster than blastn
• Contiguous Megablast– exact word match– Word size 28
• Discontiguous Megablast– initial word hit with mismatches– cross-species comparison
![Page 75: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/75.jpg)
MegaBLAST
AI217550AI251192AI254381BE645079
C:\seq\hs.4.fsa
> 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC> 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' endGAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTCCTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAAGCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT
![Page 76: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/76.jpg)
Templates for Discontiguous Words
W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111
Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5
W = word size; # matches in template
t = template length (window size within which the word match is evaluated)
![Page 77: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/77.jpg)
Scoring Systems - Nucleotides
A G C T
A +1 –3 –3 -3
G –3 +1 –3 -3
C –3 –3 +1 -3
T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| ||||| raw score = 19-9 = 10
CACGTAGCAAGCTTG-GTGTCA
![Page 78: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/78.jpg)
Scoring Systems - Proteins
Position Independent MatricesPAM Matrices (Percent Accepted Mutation)
• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly
conserved blocks• Each matrix derived separately from blocks with a
defined percent identity cutoff• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST
![Page 79: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/79.jpg)
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
![Page 80: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/80.jpg)
Gapped Alignments
• Gapping provides more biologically realistic alignments
• Statistical behavior is not completely understood forgapped alignments
• Gapped BLAST parameters must be found by simulations for each matrix
Gap costs: -(a+bk)
a = gap open penalty b = gap extend penalty k= number of residues
For example: A gap of 1 residue receives the score “-(a+b)”.
![Page 81: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/81.jpg)
Scores
V D S – C Y
V E T L C F
BLOSUM62 +4 +2 +1 -12 +9 +3 = 7
PAM30 +7 +2 0 -10 +10 +2 =. 11
Simply add the scoresfor each pair of aligned residues
and (as necessary) factor in the gaps!
Different matrices produce different scores!
![Page 82: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/82.jpg)
Lower BLOSUM series means more divergence
Higher PAM series means more divergence
better for finding local alignments
better for finding global alignments and remote homologs
based on groups of related sequences counted as one
based on minimum replacement or maximum parsimony
Built from vast amout of dataBuilt from small amout of data
Built from local alignmentsBuilt from global alignments
BLOSUMBLOSUMPAMPAM
Matrix differencesMatrix differences
![Page 83: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/83.jpg)
Matrices - Rules of thumb
Need different levels of sensitivity ?– Close relationships (Low PAM number (PAM 1) or
high Blosum number, eg. 80)– Distant relationships (High PAM (e.g. PAM 250),
low Blosum (BLOSUM 45)
![Page 84: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/84.jpg)
Local Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score
Alig
nm
en
ts
(applies to ungapped alignments)
E = Kmne-S E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect ValueE = number of database hits you expect to find by
chancesize of database
your score
expected number of
random hits
![Page 85: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/85.jpg)
WWW BLAST
![Page 86: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/86.jpg)
The BLAST homepage
Standard databases
Specialized Databases
![Page 87: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/87.jpg)
BLAST Databases: Nucleic Acid
• nr (nt)– Traditional GenBank– NM_ and XM_ RefSeqs
• refseq_rna
• refseq_genomic– NC_ RefSeqs
• dbest – EST Division
• est_human, mouse, others
• htgs – HTG division
• gss – GSS division
• wgs– whole genome shotgun
• env_nt– environmental samples
![Page 88: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/88.jpg)
Options for Advanced Blasting: Nucleotide
Example Entrez Queriesnucleotide all[Filter] NOT mammalia[Organism]green plants[Organism]biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced-W 7 word size–e 10000 expect value-v 2000 descriptions-b 2000 alignments
![Page 89: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/89.jpg)
BLAST Databases: Non-redundant protein
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF• PDB (sequences from structures)
pat protein patents
env_nr environmental samples
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF• PDB (sequences from structures)
pat protein patents
env_nr environmental samples
![Page 90: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/90.jpg)
Advanced Options: Filter
all[Filter] NOT mammals[Organism]
gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
all[Filter] NOT mammals[Organism]
gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
Default settingDefault setting
Hides low complexity for initial word hits onlyHides low complexity
for initial word hits only
Masks regions of query in lower case (pre-masked)Masks regions of query in lower case (pre-masked)
![Page 91: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/91.jpg)
BLAST Formatting Page
![Page 92: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/92.jpg)
BLAST Output: Graphic
mouse over
Sort by taxonomy
![Page 93: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/93.jpg)
BLAST Output: Descriptions
link to entrez
Sorted by e values
3 X 10-12
Default e value cutoff 10
Gene Linkout
![Page 94: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/94.jpg)
TaxBLAST: Taxonomy Reports
![Page 95: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/95.jpg)
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)
Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)
Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
BLAST Output: Alignments
Identical match
positive score(conservative)
negative substitution
gap
![Page 96: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/96.jpg)
BLAST Output: Alignments
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756
Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDASbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct: 396 FLQPLSKPLSS 406
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756
Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDASbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct: 396 FLQPLSKPLSS 406
low complexity sequence filtered
![Page 97: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/97.jpg)
Neighbors: Precomputed BLAST
Nucleotide
Protein
Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.
![Page 98: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/98.jpg)
Blink – Protein BLAST Alignments
• Lists only 200 hits • List is nonredundant
![Page 99: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/99.jpg)
PSI-BLAST
Position-Specific Iterated BLAST
• Mining for protein domains• Confirming relationships among related proteins
![Page 100: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/100.jpg)
Position-Specific Scoring Matrix (PSSM)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Serine is scored differently in these two positions.
Active site nucleophile
![Page 101: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/101.jpg)
Position Specific Iterative BLAST:PSI-BLAST
Create your own PSSM:Finding protein families
based on your own sequence.
query BLOSUM62PSSM AlignmentAlignment
![Page 102: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/102.jpg)
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK
PSI-BLAST
e value cutoff for PSSM
![Page 103: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/103.jpg)
RESULTS: Initial BLASTPSame results as protein-protein BLAST
![Page 104: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/104.jpg)
Results of First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
![Page 105: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/105.jpg)
Third PSSM Search: Convergence
Just below threshold, another nucleotide metabolism enzyme
Check to add to PSSM
![Page 106: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/106.jpg)
Reverse Position Specific Iterative-BLAST (a.k.a. RPS-BLAST or CDD Search)
A sequence search of the Conserved Domain Database (CDD)
containing curated Position-Specific Scoring Matrices. 10 20 30 40 50 60 . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . |consensus 1 KWEIPREDLTLGKKLGEGAFGEVYKGTLKGkgd---nkSIDVAVKTLKEDASEeqIKEFL 571FGI A 1 aWEIPRESLRLEVKLGQGCFGEVWMGTWNG--------TTRVAIKTLKPGTMS--PEAFL 3111BYG A 1 RWELPRDRLVLgkPLGEGAFGQVYLAEAIglgkdkpnrvTKVAVKMLKSDAtedkLSLDI 74gi 125135 1 GWALNMKELKLlqTIGKGEFGDVMLGDYRg---------NKVAVKCIKNDAt---AQAFL 62gi 125702 1 KYEIPRTDLTLkhKLGGGQYGEVYEGVWKky-------sLTVAVKTLKEDTm--eVEEFL 284gi 1174437 1 KWEIPRSELTIlrKLGRGNFGEVFYGKWRn--------sIDVAVKTLREGTm--sTAAFL 325
PSSM Sources
Pfam Sanger 7255SMART EMBL 663COG NCBI 4873KOG NCBI 4825CD NCBI 645
![Page 107: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/107.jpg)
Reverse Position Specific Iterative-BLAST (a.k.a. RPS-BLAST or CD Search)
Query: sequence Database: PSSMs
P03958
![Page 108: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/108.jpg)
![Page 109: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/109.jpg)
Result: TyrKc
![Page 110: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/110.jpg)
Questions:
• Searching for p53 protein homologs with annotation of CDD.
• Can you put codon 72 SNP into 3D protein structure?
![Page 111: Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica](https://reader034.vdocuments.site/reader034/viewer/2022050805/56649eda5503460f94be9850/html5/thumbnails/111.jpg)
Other Areas to Cover
• Genomic Data
• Annotation
• Common Domains prediction WWW
• Other Useful Genome Browsers