data acquisition from bio-databases and...

Post on 24-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DATA ACQUISITION FROM BIO-DATABASES

AND BLASTNatapol Pornputtapong

18 January 2018

DATABASE

• Collections of data

• To share – multi-user interface

• To prevent data loss

• To make sure to get the right things

Bioinformatics for Phylogenetic Analysis Workshop 2

LIBRARY -> DIGITAL LIBRARY

Bioinformatics for Phylogenetic Analysis Workshop 3

DATABASE: A LIBRARY OF DATA

Database• Files, Tables, Records

• Data structure

• Database management system

• Programming interface

• User interface

Library• Books

• building, shelves

• Librarian

• Protocols, SOPs

• Services

Bioinformatics for Phylogenetic Analysis Workshop 4

ADVANTAGE OF DATABASE

• Data integrity

• Smaller space

• Data availability

• Speed

Bioinformatics for Phylogenetic Analysis Workshop 5

DATABASE FOR USERS

Bioinformatics for Phylogenetic Analysis Workshop 6

Database

Search

Download

Users

Submission

HOW TO CHOOSE DATABASE?

• 1695 bio-databases in NAR online Molecular Biology Database Collection in 15 categories

Bioinformatics for Phylogenetic Analysis Workshop 7

DATA CONTENT

• Literature

• DNA sequence

• Protein sequence

Bioinformatics for Phylogenetic Analysis Workshop 8

GenBank

RefSeq TrEMBL

CONCEPTS OF DATABASE

Bioinformatics for Phylogenetic Analysis Workshop 9

Source Source Source

Database

interface

DatabaseDatabase

Database

Database

interface• Primary database• Secondary database

PRIMARY & SECONDARY DB

Primary database Secondary database

Synonyms Archival database Curated database; knowledgebase

Source of data Direct submission of experimentally-derived data from researchers

Results of analysis, literature research and interpretation, often of data in primary databases

Examples •ENA, GenBank and DDBJ (nucleotide sequence)•ArrayExpress Archive and GEO (functional genomics data)•Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)

•InterPro (protein families, motifs and domains)•UniProt Knowledgebase (sequence and functional information on proteins)•Ensembl (variation, function, regulation and more layered onto whole genome sequences)

Bioinformatics for Phylogenetic Analysis Workshop 10

DATA COLLECTION CRITERIA

Bioinformatics for Phylogenetic Analysis Workshop 11

GenBank RefSeq

GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences

Bioinformatics for Phylogenetic Analysis Workshop 12

ACCESSIBILITY: TOOLS & INTERFACES

Bioinformatics for Phylogenetic Analysis Workshop 13

NCBI Entrez RESTful interface to the ENA

NCBI SEARCH TOOL

Bioinformatics for Phylogenetic Analysis Workshop 14

SIMPLE SEARCH

Bioinformatics for Phylogenetic Analysis Workshop 15

BOOLEAN OPERATOR

Bioinformatics for Phylogenetic Analysis Workshop 16

FILTER

• Limit with filter

• Advanced search builder

Bioinformatics for Phylogenetic Analysis Workshop 17

RESULTS

Bioinformatics for Phylogenetic Analysis Workshop 18

BLAST: BASIC LOCAL ALIGNMENT SEARCH TOOL

Bioinformatics for Phylogenetic Analysis Workshop 19

MAJOR BLAST PROGRAMS

Bioinformatics for Phylogenetic Analysis Workshop 20

BLAST SEARCH

Bioinformatics for Phylogenetic Analysis Workshop 21

Bioinformatics for Phylogenetic Analysis Workshop 22

OTHER BLAST PROGRAMS

Bioinformatics for Phylogenetic Analysis Workshop 23

WORLD OF FILES

Text files Binary files

Bioinformatics for Phylogenetic Analysis Workshop 24

TEXT FILES: WORLD OF FORMATS

• MS Words: .doc, .docx, .rtf, .txt

• Sequence: FastA (.fasta), Genbank (.gbk)

• Protein structure: PDB (.pdb)

Bioinformatics for Phylogenetic Analysis Workshop 25

FASTA FORMAT

>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)

QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMP

FHVTKQESKPVQMMCMNNSFNVATLPAEKMKILELPFASGDL

SMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVY

LPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKI

SQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP

FLFLIKHNPTNTIVYFGRYWSP

>…

Bioinformatics for Phylogenetic Analysis Workshop 26

GENBANKFORMAT

Bioinformatics for Phylogenetic Analysis Workshop 27

NEXUS FORMAT

#NEXUS

BEGIN DATA;

DIMENSIONS NTAX=8 NCHAR=1202;

FORMAT MISSING=? DATATYPE=PROTEIN GAP=-;

OPTIONS GAPMODE=MISSING;

MATRIX

[ 10 20 ...]

[ ---------|---------|-...]

Homo_sapiens_4379045 TERLVLPPPDPLDLPLRAVEL...

Pan_troglodytes_114606536 TERLVLPPPDPLDLPLRAVEL...

Ailuropoda_melanoleuca_301788522 TERLVLPPPDPLDLPLRPVEL...

Mus_musculus_87252727 TERLVLPPLDPLNLPLRALEV...

Danio_rerio_113678409 MDKIDLPPVGPDDLPLSLLEM...

Xenopus_tropicalis_301627725 MNTLDLSNRDPLDLPLSVLEL...

Monodelphis_domestica_126309591 TERLVLPPRGPLDLPLCALEL...

Canis_familiaris_73972333 TERLALPPPDPLDLPLRPVEL...;

END;

Bioinformatics for Phylogenetic Analysis Workshop 28

NEXT

Bioinformatics for Phylogenetic Analysis Workshop 29

Inputs Analysis Results

QUESTION?

Bioinformatics for Phylogenetic Analysis Workshop 30

top related