aidan budd, embl heidelberg introduction to bioinformatics monday 26th september 2011 embo practical...
TRANSCRIPT
Aidan Budd, EMBL Heidelberg
Introduction to Bioinformatics
Monday 26th September 2011
EMBO Practical Course: Protein Bioinformatics ToolsSeptember 25th - 30th 2011
EMBL Heidelberg
Aidan BuddEMBL Heidelberg,
Germany
Niall HaslamUniversity College
DublinConway Institute,
Ireland
Aidan Budd, EMBL Heidelberg
Introduction to the Introduction....
Why include such a session in this course?
Haven't we all had "Introduction to Bioinformatics" courses in our studies, or have quite some experience of the topic already?
Hands up who feels that this describes their situation...?
Aidan Budd, EMBL Heidelberg
Introduction to the Introduction....
Why include such a session in this course?
1. Because of the diversity of your backgrounds and experience• learning occurs in the context of our own specific set of
previous experiences• different people have different understandings of the
same terms etc.exploring your (and our) understanding of some key/basic bioinformatics ideas helps:identify and address possible misconceptions that might hinder learning more sophisticated ideas/contentfocus your learning on the most important topics and issues for you (rather than us just guessing what you might need help with)• some of you will need this less than others:those with
more experience, please help those around you
Aidan Budd, EMBL Heidelberg
Introduction to the Introduction....
Why include such a session in this course?
2. To demonstrate general principles of how bioinformaticians address problems:
• show how we link tools together within the context of larger analyses
• highlighting the kinds of patterns/information that we tend to focus on
• experts in a field are better at noticing important information/patterns in the data they work with• by highlighting patterns we notice when working with
tools may help you to start spotting similar patterns i.e. becoming more expert in the topic
Aidan Budd, EMBL Heidelberg
Exploring Your Experience With Bioinformatics
Aidan Budd, EMBL Heidelberg
Exploring Your Experience With Bioinformatics
3 questions on the next slides aim to help you (and us) explore your current ideas, level of confidence, and understanding of bioinformatics For each question:
1.I'll present an example answer
2.You'll spend a few minutes writing (laptop, paper, desktop computer) your own answers to these questions
3.You'll discuss these answers with your neighbours - explaining them to each other, and identifying shared understanding (and problems with understanding)
4.We'll solicit and discuss answers from the class, focusing on answers/problems shared by several trainees
Aidan Budd, EMBL Heidelberg
Question 1:Useful Bioinformatics Resources
Which bioinformatics resource(s) have been most useful to you in your work so far?
Why are they so important (think about what would be more difficult/impossible if these the tools did not exist)?
BLAST
without BLAST (or similar pairwise alignment/sequence similarity search tools) it would be difficult to
•identify records within a database corresponding to my protein molecule/sequence of interest
• relying instead on text-based searches which can be problematic
•obtain suggestion of specific hypotheses for the function of novel sequences
• I'd have to test in the lab for 1000s of different possible functions...
Example:
Aidan Budd, EMBL Heidelberg
Question 1:Useful Bioinformatics Resources
•UniProt•ENSEMBL•PFAM•etc.
Which bioinformatics resource(s) have been most useful to you in your work so far?
Why are they so important (think about what would be more difficult/impossible if these the tools did not exist)?Other possible tools:
Aidan Budd, EMBL Heidelberg
Question 2:Common Problems
Are there any common problems you have encountered while using bioinformatics tools?
How have you tried to deal with these problems?Data records/resources changing with time/disappearing, meaning I can't reproduce my earlier results
One way I try to deal with this problem is to keep copies of the original files I downloaded - in particular the sequence (not just the identifiers) of any proteins/DNA regions of interest
Example:
Aidan Budd, EMBL Heidelberg
Question 3:Key Knowledge/Experience/Tricks
Realising that almost all bioinformatics tools and resources aim to address either one or both of two key questions has often helped me in my work i.e.:•what experimental data has been reported concerning my entity (protein) of interest [e.g. much of the data in UniProt]•what predictions can I make about the structure/function of my entity (protein) of interest [e.g. BLAST, IUPRED]
Many different bioinformatics resources, no time to learn about them all!
Knowing this helps me identify the questions a tool aims to address, firstly in general terms, and then more specifically (which data, which entities etc.).
This makes makes me more efficient at choosing appropriate tools for a job.
What bioinformatics knowledge/experience/tricks have you learnt that you wish you had been taught at the start of your research career?
How have these ideas been useful for you?Example:
Aidan Budd, EMBL Heidelberg
Question 3:Key Knowledge/Experience/Tricks
Realising that almost all bioinformatics tools and resources aim to address either one or both of two key questions has often helped me in my work i.e.:•what experimental data has been reported concerning my entity (protein) of interest [e.g. much of the data in UniProt]•what predictions can I make about the structure/function of my entity (protein) of interest [e.g. BLAST, IUPRED]
Also helps me as I know that using bionformatics to help my analysis means it helps to frame the questions I ask in the terms of these two kinds of question
What bioinformatics knowledge/experience/tricks have you learnt that you wish you had been taught at the start of your research career?
How have these ideas been useful for you?Example:
Aidan Budd, EMBL Heidelberg
Question 3:Key Knowledge/Experience/Tricks
Using an accession number (a unique identifier of a record within a database) allows me to unambiguously identify the record I want from a data resource.
Searches with non-unique identifiers can return several very different entities from a search, where several of them do not correspond to the entity I want to identify - using unique identifiers avoids this problem.
What bioinformatics knowledge/experience/tricks have you learnt that you wish you had been taught at the start of your research career?
How have these ideas been useful for you?Example:
Aidan Budd, EMBL Heidelberg
Knowledge/Experience/Tricks
Knowledge/experience/tricks on doing successful bioinformatic analyses are some of the more useful things you could take from a course like this.
Thus, we'll now present some that we've found useful in our own work.
After they've been presented, we will ask you to read through them and (quickly) discuss what you understand by them with your neighbours, to try and highlight any major misunderstandings
Then we will illustrate these points by demonstrating for you an example of a bioinformatic analysis that illustrates many of these points, and how they are built into a "complete" analysis.
If you notice some of these tricks etc. being used in the analysis but not commented on by us, please note them and we'll discuss them at the end of the demonstration
Aidan Budd, EMBL Heidelberg
Diversity of Bioinformatics Resources
There are many many different bioinformatics resources available, and they change with time, sometimes dramatically...•too many for me to know them all•for those I know, I usually don't have time to spend understanding everything about how they work, what can be done with them, all their features, etc.
Thus, becoming better at the following tasks helps make me a more efficient and confident bioinformatician:•identifying/searching for/finding those resources that can help my research•quickly judging whether or not a tool is likely to be useful for my research•spotting when I've learnt enough about a tool, so that I can use it reasonably effectively•knowing that not understanding (all about) how a tool works is not a failure - it's normal - what's important is deciding whether you need to learn more
Aidan Budd, EMBL Heidelberg
The Two Key Bioinformatics Questions
•what experimental data has been reported concerning my entity (protein) of interest [e.g. much of the data in UniProt]
•what predictions can I make about the structure/function of my entity (protein) of interest [e.g. BLAST, IUPRED]
We already discussed this as an example answer.
To remind you, I think the key questions are:
Aidan Budd, EMBL Heidelberg
Incomplete Overlap of Resources
Many data resources contain some of the same/similar data as each other i.e. have partial but not complete overlap of their content. For example, the sequences in the SwissProt databases searched by NCBI BLAST at NCBI and EBI on any one day might contain different sets of sequences. Being aware of this, I know that•if I'm looking for something (e.g. a protein sequence) in one resource, and can't find it there, then I may find it if I look elsewhere•these differences exist because:
•different aims of the developers of different resources•different update schedules•different amounts of resources available to maintain and
update resources•these differences are inevitable - knowing this helps prevent me from getting (too) frustrated when tools don't contain what I think they should
Aidan Budd, EMBL Heidelberg
Different Features of Different Tools
Different implementations of the same tool may have different search features, different ways of presenting the output etc., even if the content is the sameRelated/similar/part of the previous point.
For example - the web interface to BLAST at EBI and NCBI are rather different and offer different features - some things are easy to do on one site and almost impossible on the other
So, if the implementation you're working with doesn't do what you want, you may be able to find one that does somewhere else
Aidan Budd, EMBL Heidelberg
The Importance of Knowing Which Question You Want to
AnswerThe "right way" to use a tool depends on the question you want to address with it
How should I use UniProt?it depends on what you want to use it for
How should I change parameters to improve my BLAST search?
it depends on what you want to use it for
Which MSA tools should I use to align my sequences?it depends on what you want to use it for
etc.
Thus, a clear understanding of precisely which question you want to address helps us use tools more effectively
Aidan Budd, EMBL Heidelberg
Importance of Accession Numbers When Using a Text
Search of a Database
Using an accession number (a unique identifier of a record within a database) allows me to unambiguously identify the record I want from a data resource.
Searches with non-unique identifiers can return several very different entities from a search, where several of them do not correspond to the entity I want to identify - using unique identifiers avoids this problem.
We covered this already...
Aidan Budd, EMBL Heidelberg
The More You Know the Easier it Gets
Just having experience recognising where identifiers are likely to come from, knowing things about that structure of important databases, common errors found in databases, makes it easier to spot important patterns
For example, I recognise ENSDARG00000046048 as an Ensembl identifier immediately, so would know where to begin etc.Thus, spending some time working with and exploring key resources can be a big help running a range of different bioinformatics tasks
Aidan Budd, EMBL Heidelberg
Example Bioinformatics Analysis
Demonstrating the use tools to address problems/questions must be done in the context of a particular problem/question - because, as already pointed out, the way to use a tool effectively depends always on the question it is being used to address.
Scenario:
A friend working in a zebrafish lab has done a forward genetic analysis, using a phenotype, and has identified the mutated gene
They want to try and understand how/why the gene contributes to the phenotype, in particular by identifying or predicting proteins that physically interact with the gene
perhaps knocking these out/silencing them will have a similar pheontype?
They tell us it's called ENSDARG00000046048
Aidan Budd, EMBL Heidelberg
Example Bioinformatics Analysis
Would you like to:1. Try this yourselves with no more information/ideas from me?2. Try this yourselves with some hints on resources you might
like to try?3. I demo it to you first, then you have a go yourself? - in which
case you'll get a short written description of how I did it to try and follow yourselves
If you try first, then do it in pairs, keep going until you get stuck - then get help! - we'll try and notice when several people are stuck, and then we'll move on. Think about, in this case, what contributed to you getting stuck
Aidan Budd, EMBL Heidelberg
Example Bioinformatics AnalysisHints on how I would/did do it... ENSDARG00000046048
• Find protein sequence of the ENSEMBL record
• Get the UniProt record - two ways of doing it, database cross-linking and BLAST (note that I try first swissprot and it's not there, but it is in uniprot)
• Read about the gene
• Look for interaction partners described in the record - via STRING maybe, but not very strong evidence
• Look for related PDB structures in complex? Yes, we find one by BLAST at NCBI
• Get the structure from PDB and look at it in PyMOL
• Look for a protein related to the interacting protein in ZF
• Get this info PDBsum - I find it easier to get the info on which sequences are in there compared to PDB
• Read about the interaction - is there a model for describing the interaction modules in the two proteins? Yes, it's the FFAT motif described by the ELM resource
• Is the pattern conserved in the interacting protein? Yes.
• Other proteins possessing this module in ZF might also interact with the query protein
Aidan Budd, EMBL Heidelberg
Another example
Scenario:
A friend is working on the parasite Giardia. They want to study the role of nucleoporin proteins in the biology of the parasite, expecting this might be important understanding gene regulation there etc.
They want to find the sequences of these proteins to help with their cloning etc. They ask you for help
Discuss with your neighbours some of the ways you could begin to try and find the sequences of these proteins
Aidan Budd, EMBL Heidelberg
Another example
•Search for a Giardia genome resource - try a text search for nucleoporins•Check which ones this is matching using BLAST•Google for "nucleoporins"•Choose one of them•Try a BLAST at the NCBI• If it doesn't work, what modules do SMART/PFAM predict in the sequence•Try using these tools to identify similar proteins in the organism
Scenario:
A friend is working on the parasite Giardia. They want to study the role of nucleoporin proteins in the biology of the parasite, expecting this might be important understanding gene regulation there etc.
They want to find the sequences of these proteins to help with their cloning etc. They ask you for help
Aidan Budd, EMBL Heidelberg
Another example
Try, together with your neighbour, to find some other Giardia nucleoporin sequences
Scenario:
A friend is working on the parasite Giardia. They want to study the role of nucleoporin proteins in the biology of the parasite, expecting this might be important understanding gene regulation there etc.
They want to find the sequences of these proteins to help with their cloning etc. They ask you for help