aidan budd, embl heidelberg introduction to bioinformatics monday 26th september 2011 embo practical...

Aidan Budd, EMBL Heidelberg

Introduction to Bioinformatics

Monday 26th September 2011

EMBO Practical Course: Protein Bioinformatics ToolsSeptember 25th - 30th 2011

EMBL Heidelberg

Aidan BuddEMBL Heidelberg,

Germany

Niall HaslamUniversity College

DublinConway Institute,

Ireland


Introduction to the Introduction....

Why include such a session in this course?

Haven't we all had "Introduction to Bioinformatics" courses in our studies, or have quite some experience of the topic already?

Hands up who feels that this describes their situation...?




1. Because of the diversity of your backgrounds and experience• learning occurs in the context of our own specific set of

previous experiences• different people have different understandings of the

same terms etc.exploring your (and our) understanding of some key/basic bioinformatics ideas helps:identify and address possible misconceptions that might hinder learning more sophisticated ideas/contentfocus your learning on the most important topics and issues for you (rather than us just guessing what you might need help with)• some of you will need this less than others:those with

more experience, please help those around you




2. To demonstrate general principles of how bioinformaticians address problems:

• show how we link tools together within the context of larger analyses

• highlighting the kinds of patterns/information that we tend to focus on

• experts in a field are better at noticing important information/patterns in the data they work with• by highlighting patterns we notice when working with

tools may help you to start spotting similar patterns i.e. becoming more expert in the topic


Exploring Your Experience With Bioinformatics


Exploring Your Experience With Bioinformatics

3 questions on the next slides aim to help you (and us) explore your current ideas, level of confidence, and understanding of bioinformatics For each question:

1.I'll present an example answer

2.You'll spend a few minutes writing (laptop, paper, desktop computer) your own answers to these questions

3.You'll discuss these answers with your neighbours - explaining them to each other, and identifying shared understanding (and problems with understanding)

4.We'll solicit and discuss answers from the class, focusing on answers/problems shared by several trainees


Question 1:Useful Bioinformatics Resources

Which bioinformatics resource(s) have been most useful to you in your work so far?

Why are they so important (think about what would be more difficult/impossible if these the tools did not exist)?

BLAST

without BLAST (or similar pairwise alignment/sequence similarity search tools) it would be difficult to

•identify records within a database corresponding to my protein molecule/sequence of interest

• relying instead on text-based searches which can be problematic

•obtain suggestion of specific hypotheses for the function of novel sequences

• I'd have to test in the lab for 1000s of different possible functions...

Example:


Question 1:Useful Bioinformatics Resources

•UniProt•ENSEMBL•PFAM•etc.

Which bioinformatics resource(s) have been most useful to you in your work so far?

Why are they so important (think about what would be more difficult/impossible if these the tools did not exist)?Other possible tools:


Question 2:Common Problems

Are there any common problems you have encountered while using bioinformatics tools?

How have you tried to deal with these problems?Data records/resources changing with time/disappearing, meaning I can't reproduce my earlier results

One way I try to deal with this problem is to keep copies of the original files I downloaded - in particular the sequence (not just the identifiers) of any proteins/DNA regions of interest

Example:


Question 3:Key Knowledge/Experience/Tricks

Realising that almost all bioinformatics tools and resources aim to address either one or both of two key questions has often helped me in my work i.e.:•what experimental data has been reported concerning my entity (protein) of interest [e.g. much of the data in UniProt]•what predictions can I make about the structure/function of my entity (protein) of interest [e.g. BLAST, IUPRED]

Many different bioinformatics resources, no time to learn about them all!

Knowing this helps me identify the questions a tool aims to address, firstly in general terms, and then more specifically (which data, which entities etc.).

This makes makes me more efficient at choosing appropriate tools for a job.

What bioinformatics knowledge/experience/tricks have you learnt that you wish you had been taught at the start of your research career?

How have these ideas been useful for you?Example:



Realising that almost all bioinformatics tools and resources aim to address either one or both of two key questions has often helped me in my work i.e.:•what experimental data has been reported concerning my entity (protein) of interest [e.g. much of the data in UniProt]•what predictions can I make about the structure/function of my entity (protein) of interest [e.g. BLAST, IUPRED]

Also helps me as I know that using bionformatics to help my analysis means it helps to frame the questions I ask in the terms of these two kinds of question





Using an accession number (a unique identifier of a record within a database) allows me to unambiguously identify the record I want from a data resource.

Searches with non-unique identifiers can return several very different entities from a search, where several of them do not correspond to the entity I want to identify - using unique identifiers avoids this problem.




Knowledge/Experience/Tricks

Knowledge/experience/tricks on doing successful bioinformatic analyses are some of the more useful things you could take from a course like this.

Thus, we'll now present some that we've found useful in our own work.

After they've been presented, we will ask you to read through them and (quickly) discuss what you understand by them with your neighbours, to try and highlight any major misunderstandings

Then we will illustrate these points by demonstrating for you an example of a bioinformatic analysis that illustrates many of these points, and how they are built into a "complete" analysis.

If you notice some of these tricks etc. being used in the analysis but not commented on by us, please note them and we'll discuss them at the end of the demonstration


Diversity of Bioinformatics Resources

There are many many different bioinformatics resources available, and they change with time, sometimes dramatically...•too many for me to know them all•for those I know, I usually don't have time to spend understanding everything about how they work, what can be done with them, all their features, etc.

Thus, becoming better at the following tasks helps make me a more efficient and confident bioinformatician:•identifying/searching for/finding those resources that can help my research•quickly judging whether or not a tool is likely to be useful for my research•spotting when I've learnt enough about a tool, so that I can use it reasonably effectively•knowing that not understanding (all about) how a tool works is not a failure - it's normal - what's important is deciding whether you need to learn more


The Two Key Bioinformatics Questions

•what experimental data has been reported concerning my entity (protein) of interest [e.g. much of the data in UniProt]

•what predictions can I make about the structure/function of my entity (protein) of interest [e.g. BLAST, IUPRED]

We already discussed this as an example answer.

To remind you, I think the key questions are:


Incomplete Overlap of Resources

Many data resources contain some of the same/similar data as each other i.e. have partial but not complete overlap of their content. For example, the sequences in the SwissProt databases searched by NCBI BLAST at NCBI and EBI on any one day might contain different sets of sequences. Being aware of this, I know that•if I'm looking for something (e.g. a protein sequence) in one resource, and can't find it there, then I may find it if I look elsewhere•these differences exist because:

•different aims of the developers of different resources•different update schedules•different amounts of resources available to maintain and

update resources•these differences are inevitable - knowing this helps prevent me from getting (too) frustrated when tools don't contain what I think they should


Different Features of Different Tools

Different implementations of the same tool may have different search features, different ways of presenting the output etc., even if the content is the sameRelated/similar/part of the previous point.

For example - the web interface to BLAST at EBI and NCBI are rather different and offer different features - some things are easy to do on one site and almost impossible on the other

So, if the implementation you're working with doesn't do what you want, you may be able to find one that does somewhere else


The Importance of Knowing Which Question You Want to

AnswerThe "right way" to use a tool depends on the question you want to address with it

How should I use UniProt?it depends on what you want to use it for

How should I change parameters to improve my BLAST search?

it depends on what you want to use it for

Which MSA tools should I use to align my sequences?it depends on what you want to use it for

etc.

Thus, a clear understanding of precisely which question you want to address helps us use tools more effectively


Importance of Accession Numbers When Using a Text

Search of a Database

Using an accession number (a unique identifier of a record within a database) allows me to unambiguously identify the record I want from a data resource.

Searches with non-unique identifiers can return several very different entities from a search, where several of them do not correspond to the entity I want to identify - using unique identifiers avoids this problem.

We covered this already...


The More You Know the Easier it Gets

Just having experience recognising where identifiers are likely to come from, knowing things about that structure of important databases, common errors found in databases, makes it easier to spot important patterns

For example, I recognise ENSDARG00000046048 as an Ensembl identifier immediately, so would know where to begin etc.Thus, spending some time working with and exploring key resources can be a big help running a range of different bioinformatics tasks


Example Bioinformatics Analysis

Demonstrating the use tools to address problems/questions must be done in the context of a particular problem/question - because, as already pointed out, the way to use a tool effectively depends always on the question it is being used to address.

Scenario:

A friend working in a zebrafish lab has done a forward genetic analysis, using a phenotype, and has identified the mutated gene

They want to try and understand how/why the gene contributes to the phenotype, in particular by identifying or predicting proteins that physically interact with the gene

perhaps knocking these out/silencing them will have a similar pheontype?

They tell us it's called ENSDARG00000046048


Example Bioinformatics Analysis

Would you like to:1. Try this yourselves with no more information/ideas from me?2. Try this yourselves with some hints on resources you might

like to try?3. I demo it to you first, then you have a go yourself? - in which

case you'll get a short written description of how I did it to try and follow yourselves

If you try first, then do it in pairs, keep going until you get stuck - then get help! - we'll try and notice when several people are stuck, and then we'll move on. Think about, in this case, what contributed to you getting stuck


Example Bioinformatics AnalysisHints on how I would/did do it... ENSDARG00000046048

• Find protein sequence of the ENSEMBL record

• Get the UniProt record - two ways of doing it, database cross-linking and BLAST (note that I try first swissprot and it's not there, but it is in uniprot)

• Read about the gene

• Look for interaction partners described in the record - via STRING maybe, but not very strong evidence

• Look for related PDB structures in complex? Yes, we find one by BLAST at NCBI

• Get the structure from PDB and look at it in PyMOL

• Look for a protein related to the interacting protein in ZF

• Get this info PDBsum - I find it easier to get the info on which sequences are in there compared to PDB

• Read about the interaction - is there a model for describing the interaction modules in the two proteins? Yes, it's the FFAT motif described by the ELM resource

• Is the pattern conserved in the interacting protein? Yes.

• Other proteins possessing this module in ZF might also interact with the query protein


Another example

Scenario:

A friend is working on the parasite Giardia. They want to study the role of nucleoporin proteins in the biology of the parasite, expecting this might be important understanding gene regulation there etc.

They want to find the sequences of these proteins to help with their cloning etc. They ask you for help

Discuss with your neighbours some of the ways you could begin to try and find the sequences of these proteins


Another example

•Search for a Giardia genome resource - try a text search for nucleoporins•Check which ones this is matching using BLAST•Google for "nucleoporins"•Choose one of them•Try a BLAST at the NCBI• If it doesn't work, what modules do SMART/PFAM predict in the sequence•Try using these tools to identify similar proteins in the organism

Scenario:




Another example

Try, together with your neighbour, to find some other Giardia nucleoporin sequences

Scenario:



aidan budd, embl heidelberg introduction to bioinformatics monday 26th september 2011 embo practical...

Documents