protein sequence databases
DESCRIPTION
Protein sequence databases. Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis www.hytti.uku.fi/~toronen/ Gradu_verkkoon.zip and from CSC bio-opas http://www.csc.fi/oppaat/bio/ http://www.csc.fi/oppaat/bio/bio-opas.pdf. - PowerPoint PPT PresentationTRANSCRIPT
Protein sequence databases
Petri TörönenShamelessly copied from material done by Eija KorpelainenThis also includes old material from my thesiswww.hytti.uku.fi/~toronen/Gradu_verkkoon.zipand from CSC bio-opashttp://www.csc.fi/oppaat/bio/http://www.csc.fi/oppaat/bio/bio-opas.pdf
Why protein sequences?
• most (laboratory) analysis is done with nucleotide sequences
• therefore the analysis at the nucleotide level is natural
But there are drawbacks
-divergence in codons => same protein, different nucleotide sequence!
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/C/Codons.html
-similarity between different aminoacids
Therefore all the similarity is not visible at the nucleotide level!
…more…
Protein databases also include often more detailed information.
Protein (not the RNA) is often the actual functional unit that has a biological function.
-note the exceptions like structural RNAs.
Protein databases
• SwissProt
• TrEMBL
• PIR-PSD
Swissprot and TrEMBL (Translated EMBL) have been unified to UniProtTHIS INFO IN PART ERRONEOUS! SwissProt still also available as a separate entity.
Differences between databases
• Some include all the available information (more or less reliable information)– large coverage, everything is stored in the database– small reliablity, information has not been confirmed– computer annotation => updating fast
• Some cover only the reliable information– small coverage– information is reliable– expert curation => updating slow
• SwissProt – TREMBL – RemTREMBL
Why Swissprot is nice?
• Sequences are manually annotated and checked
• No multiple entries for the same sequence
• Annotations include protein function, modifications after translation, active sites etc.
• Linked to many other databases
So how to search protein sequences from available databases?
• Search with a protein name
• Search with a proteins function/derscriptive words
• Search with a protein/RNA sequence
Next slides handle first two options…
Ways to access Swiss/UniProt
http://au.expasy.org/sprot/
Expasy server for UniprotNote that the page includes links to ’full text search’ and
to ’advanced search’
http://www.ebi.uniprot.org/uniprot-srv/uniProtPowerSearch.do
Power Search to UniProt databasehttp://srs.csc.fi/
One of the SRS servers availble in WWW
http://srs.ebi.ac.uk
http://srs.embl-heidelberg.de:8000/srs5/
SRS
• Sequence Retrieval System• Allows search from several databases
• not limited to SwissProt!
• AND, OR, BUTNOT type boolean operations can be used in the search (useful with keywords)
=> Works with sequence name and with complex keyword queries.
• Obtained results can be further processed:– linking to new set of databases– includes sequence analysis, sequence alingment
Select ’start a temporary project’
Select database(s). Here I select SwissProtNote that also other databases can be searched with SRS!Available databases vary between the different SRS servers.
Insert the query for looking the sequence.Here I search with the sequence name (csk_mouse).Search goes through all the text fields (AllText) in the SwissProt files
These are available fieldsthat can be searched with the search term
obtained result
Available information on the sequence.
More information from here
• Obtained result demonstrated the detailed information available from the SwissProt
• Note that the stored information includes– information on the organism– gene name, gene description– links to the articles discussing about the seq.– part comments has a detailed description on
• function• tissue localization
– part features has a detailed description on• domains• various functional components
SRS Search with boolean operators (AND, OR, BUTNOT)
Queries can be combined with & (= AND), | (= OR), ! (=NOT)Different rows are also combined (by default) with AND
The example looks for proteins with organism Name either mouse OR rat. Also the description field must include words receptor AND kinase BUTNOT tyrosine.
Further linking to other databases
We can link the obtained results with the other databases by going further from this link
Go to the results of the previous search..
Selection of sequences that have a known 3D structure
2. The box next to PDB database is selected with mouse
1. The sub folder with protein databases is opened by selecting protein function structure and interactions databases
3. Lets select here the filtering of the obtained results to the ones that have a link to 3D structure
Summary
• protein databases show detailed information of protein sequences
• Uniprot/Swissprot is recommended protein database
-manually curated-non-overlapping
• SRS is a method for searching information from selected databases with search terms
• Word of warning: Sometimes SRS does not work as nicely as hoped!
Search of the protein databases with sequences
So what can be done if we have a sequence that we do not know nothing about?We can look for similar known protein from databases.This can be done directly with protein sequences.
(Database searching is probably handled more later. Sorry for wrong order!)
Nucleotide to amino acids
If you have produced a nucleotide seq. in laboratory you might still want to compare it to protein sequences for previous reasons (slide n. 3). You’ll have two options:
1.Use tools (like BLASTX, FastX) that automatically compare the nucleotide seq. to amino acid databases.
These can search sequence similarities going from one reading frame to another. => Simple, You don’t have to worry about translating the sequence (see below)
BLASTX and FastX are explained more in detail later
2.Translate the seq. using available tools(for example http://www.ebi.ac.uk/emboss/transeq/ )
-required with tools that accept only protein sequence
-remember that you do not know the reading frame!
Correct reading frame can move from one frame to another (sequencing errors like addition or deletion of nucleotides)!!
Automatic tools comparing nucl. seq. with protein database
• BLASTX
-looks for most similar protein sequences for your nucleotide sequence by comparing all possible reading frames.
-Member of BLAST program familyhttp://www.ncbi.nlm.nih.gov/BLAST/
For nucleotide sequencesBLASTX can be obtained here
If you do a query with a protein sequencethen use this
SEQUENCE:>embl|AB029485|AB029485 Mus musculus ARIP1 mRNA for activin receptor interacting protein
protein database (SwissProt) can be selected here
You can find the seq from google with AB029485
Next Window is opened here
Web page that is given while the results are being waited.
Colour figure presents wherethe match to the database wasin our query sequence.
colour presents the goodness ofscore.
E value tells how many similarresults can be expected by random
The alingment can beviewed from this link
The alingment enablesthe manual evaluation of the result This is the link to database that we searched
giving the full information on the sequence
Changing the nucleotides to amino acids
Transeq requires you to paste the nucleotide sequence, to select the reading frame (1, 2 or 3) and to select forward or reverse direction
http://www.ebi.ac.uk/emboss/transeq/
An example sequence obtained with randomly typed g,a,c,t:DQLTCQSTVSAGLAWLAGMA
The obtained sequencesfrom different reading framescan be used to search protein databases...
Motif databases
• Motifs are conserved areas in the functionally similar proteins
• These are crucial parts for protein function– protein cannot change them without changing the
function
• Analysis of sequences with motifs can be more efficient when no close sequence relatives are found– recommended when normal sequence search gives
no results
What is motif?What is motif?
modified from Terri Attwood, 2002modified from Eija korpelainen...
Areas with strong conservation betweenalingned sequences
Motif databases
BLOCKS
http://blocks.fhcrc.org/
PROSITE
http://au.expasy.org/prosite/
...and more...
http://au.expasy.org/tools/
Subgroup Pattern and profile searches shows the list of protein motif analysis tools
INTERPROhttp://www.ebi.ac.uk/InterProScan/
Combines many motif databases in one search
can take DNA or proteinsequence.
Fragment of the BLASTX test sequence
Kinase associated motifs
PDZ domainsImportant for protein-interactions
WW domainsImportant for bindingproteins