how to be a bioinformatician
DESCRIPTION
Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.TRANSCRIPT
1
How to be a bioinformatician
Christian Frech, PhDSt. Anna Children’s Cancer Research Institute, Vienna,
Austria
Talk at University of Applied Sciences, Hagenberg, AustriaApril 23rd, 2014
What is a bioinformatician?
2
Informatician Statistician
Biologist
Bio-
inform
aticia
n Bio-statistician
Data scientist
Modified from http://blog.fejes.ca/?p=2418
Bioinformatician vs. computational biologist
Asks biological questions Analyzes & interprets
biological data Runs existing programs Ad hoc scripting Perl, R, Python
3
IT savvy Builds & maintains
biological databases & Web sites
Designs & implements clever algorithms
C/C++, Java, Python
Bioinformatician Computational biologist
Grasp of computational subjectsmore less
Grasp of biological subjectsless more
or vice versa
Why do we need bioinformaticians? Amount of generated biological data requires sophisticated
computing for data management and analysis
Programmers lack biological knowledge Biologists don’t program The two don’t understand each other
4
http://www.youtube.com/watch?v=Hz1fyhVOjr4
Latest Illumina sequencer shipped last week (HiSeq v4 reagent kit) outputs1 terabase (TB) of data in 6 days1!
Biologists talks to statistician
1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn
What are bioinformaticians doing?
5
6
What are bioinformaticians doing?
Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
Challenges as bioinformatician Biology is complex, not black and white
As many exceptions as rules (e.g.: define “gene”) No single optimal solution to a problem Results interpretable in many ways (story telling, cherry picking)
Understanding the biological question
Field is moving incredibly fast Lack of standards, immature/abandoned software
Standard of today obsolete tomorrow Much time spent on collecting/cleaning-up data, troubleshooting errors
Stay flexible, don’t overinvest in single platform/technology Hundreds of software tools and databases out there
Easy to get lost Important to understand their strengths and weaknesses
8
Which tools should I use?
9
179 tools
Heard of: 65%Used: 30%
10http://omictools.com/
Things to have in your bioinformatics toolbox
Linux command line Scripting language with
associated Bio* library (BioPerl, BioPython, R/Bioconductor, …)
Basic statistical tests, regression, p-values, maximum likelihood, multiple testing correction
Sequence alignment (FASTA & BLAST)
Biological databases Regular expressions Sequencing technologies Web technologies (HTML, XML,
…)
11
Advanced R skills Parallel/distributed computing DBMS, SQL (Semi-)compiled language (C/C++, Java) Dimensionality reduction (e.g. PCA) Cluster analysis Support Vector Machines Hidden Markov models Web framework (e.g. Django) Version control system (e.g. Git) Advanced text editor (Emacs, vim) IDE (e.g. Eclipse, NetBeans)
Must haves Highly recommended
Requirement Recommended Language
Speed matters, low-level programming
Rich-client enterprise application development
Text file processing (regex)
Statistical analysis, fancy plots
Rapid prototyping, readable & maintainable scripts
Workflow automation
What programming language should I learn?
12Be a jack of all trades, master of ONE!
Perl on decline, R and Python gaining popularity
13
http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming-languages.html
http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png
Perl most popular bioinformatics programming language in 2008
R and Python take the lead in 2014
Top 10 most common and/or annoying mistakes in
bioinformatics
14Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)
Top-10 most common/annoying mistakes in bioinformatics
# 10Using genome coordinates with wrong
genome version
(for example, using gene coordinates from human genomeversion hg18 but reference sequence from version hg19)
15
Top-10 most common/annoying mistakes in bioinformatics
# 9Forgetting to process the second strand of
DNA sequence
16
Top-10 most common/annoying mistakes in bioinformatics
# 8Processing second strand of DNA sequence,
but taking reverse instead of reverse complement sequence
17
Top-10 most common/annoying mistakes in bioinformatics
# 7Not accounting for different human
chromosomes names between UCSC and Ensembl
Example:
UCSC: “chr1”Ensembl: “1”
18
Top-10 most common/annoying mistakes in bioinformatics
# 6Assuming the alphabetical order of
chromosome names is
“chr1”, “chr2”, “chr3”, … when in fact it is
“chr1”, “chr10”, “chr11”, …
19
Top-10 most common/annoying mistakes in bioinformatics
# 5 Assuming ‘tab’ field separator
when in fact it is ‘blank’ (or vice versa)
(look almost identical in text editor)
20
Top-10 most common/annoying mistakes in bioinformatics
# 4Assuming DNA sequence consists of only
four letters (A, T, C, G) while in fact there is a fifth
21
‘N’ for missing base(‘X’ for missing amino acid)
Top-10 most common/annoying mistakes in bioinformatics
# 3Forgetting to use dos2unix on a Windows text file
before processing it under Linux
plus spending 1 hour to debug the problemplus being tricked by this multiple times
Text file line breaks differ between platforms: Linux (LF); Windows (CR+LF); classic Mac (CR).
22
Top-10 most common/annoying mistakes in bioinformatics
# 2When importing data into MS Excel, letting it auto-convert HUGO gene names into dates
and forgetting about it
(e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import)
~30 genes in total
23
#1Off-by-one error
There are only two common problems in bioinformatics:
(1) lack of standards, (2) ID conversion, and (3) off-by-one errors
24
http://en.wikipedia.org/wiki/Off-by-one_error
Top-10 most common/annoying mistakes in bioinformatics
Ten personal recommendations for your
future work as bioinformatician
25
#1 - Learn Linux! Most bioinformatics tools not available
on Windows Linux file systems better for many and/or very large files Command line interface (CLI) has advantages over
graphical user interface (GUI) Recorded command history (reproducibility) Key stroke to re-run analysis, instead of repeating 100 mouse
clicks Linux CLI (Shell) much more powerful than Windows CLI
26
# 2 - Embrace the “Unix tools philosophy” Small programs (“tools”) instead of monolithic applications
Designed for simple, specific tasks that are performed well (awk, cat, grep, wc, etc.)
Many and well documented parameters Combined with Unix pipes (read from STDIN, write to STDOUT)
cut -f 3 myfile.txt | sort | uniq Advantages
Great flexibility, easy re-use of existing tools Intermediate output can be stored and inspected for troubleshooting Complex tasks can be performed quickly with shell ‘one-liners’
This paradigm fits bioinformatics well, where often many heterogeneous data files need to be processed in many different ways
27http://www.linuxdevcenter.com/lpt/a/302
Example NGS use case demonstrating the power of the Unix tools philosophy
Explanation ‘samtools mpileup’ piles up short reads from the input BAM file for
each position in the reference genome ‘bcftools view’ calls the variants ‘vcfutils vcf2fq’ computes the consensus sequence The resulting FASTA sequence is redirected to the output file cns.fq
By knowing available tools and their parameters, bioinformatics ‘wizards’ can get complex stuff done in almost no time
28
samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
http://samtools.sourceforge.net/mpileup.shtml
#3 - Don’t reinvent the wheel Coding is fun, but look
around before you hack into your keyboard
Don’t write the 29th FASTA file parser if proven solutions are available
BioPerl BioPython Bioconductor
29
#4 - If you happen to invent a wheel, … Document source and parameters well Use version control system (git, svn) Deposit code in public repository
sourceforge.net github.com
Write test cases
30
# 5 - Automate pipelines with GNU/Make Developed in 1970s to build executables from
source files
Incredibly useful for data-driven workflows as well Automatic error checking Parallelization (utilize multiple cores) Incremental builds (re-start your pipeline from point of failure) Bug-free
Get started at http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/
31
# 6 - Value your time Architecture vs. accomplishment
“Perfect is the enemy of the good” -- Voltaire OO design and normalized databases are nice, but can be an
overkill if requirements change from analysis to analysis
Automate what can be automated Reproducibility Easy to repeat analysis with slightly changed parameters
BUT: Don’t spend two days automating a one-time analysis that can be done manually in 10 minutes
32
# 7 – Make use of free online resources to learn about specialized topics www.coursera.org
Bioinformatics Algorithms (https://www.coursera.org/course/bioinformatics)
Computing for Data Analysis (https://www.coursera.org/course/compdata)
R Programming (https://www.coursera.org/course/rprog)
https://www.edx.org/ Data Analysis for Genomics (
https://www.edx.org/course/harvardx/harvardx-ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)
Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x-introduction-biology-secret-1768#.U1TVL3V52R8)
http://rosalind.info/problems/locations/ 33
# 8 - Become an expert
Identify an area of interest and get really good at it
Work at places where you can learn from the best
Spend time abroad Great experience Labs/companies will not only hire you for what you
know, but who you know
34
# 9 - Decide early on if you want to stay in academia or go into industry
35
Academia Industry• PhD highly recommended• Take your time to find
compatible supervisor
+ Freedom to pursue own ideas+ Very flexible working hours+ Work independently
- Steep & competitive career ladder (postdoc >> PI/prof)
- Lower pay- Publish or perish
• PhD beneficial (to get in), but not necessarily required for daily work (e.g. build/maintain databases)
+ More frequent (positive) feedback
+ Higher pay+ Job security
- More (external) deadlines- Higher pressure to get things
done
See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
# 10 - Stay informed & get connected Follow literature and blogs
http://en.wikipedia.org/wiki/List_of_bioinformatics_journals http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay-current-in-bioinformaticsgenomics/ Subscribe via RSS feeds
http://feedly.com or others Platform independent (e.g. read on your phone)
Bioinformatics Q&A forums http://www.biostars.org (highly recommended) http://seqanswers.com/ (focus on NGS) http://www.reddit.com/r/bioinformatics/ (student-oriented)
Other http://bioinformatics.org – fosters collaboration in bioinformatics http://www.researchgate.net – “Facebook” for researchers German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin)
36
Conclusion As bioinformatician, you will be at the
forefront of one of the greatest scientific enterprises of our time
Biologists overwhelmed with massive data sets
YOU will get to see exciting results first
Requires integration of knowledge from many domains IT, biology, medicine, statistics, math, …
Knowing your informatics toolbox AND understanding the biological question is what makes you very valuable
37
Further Reading “So you want to be a computational biologist?”
http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html “What It Takes to Be a Bioinformatician”
http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/ “The alternative ‘what it takes to be a bioinformatician’”
https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/ “So You Want To Be a Computational Biologist, Or A Bioinformatician?”
http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html “Being a bioinformatician is hard”
http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/ “How not to be a bioinformatician”
http://www.scfbm.org/content/7/1/3 “Ten Simple Rules for Reproducible Computational Research”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285 “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2E21C9E565378C7F714D76
“A Quick Guide for Developing Effective Bioinformatics Programming Skills”http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589
“What Is Really the Salary of a Bioinformatician/Computational Biologist?”http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational-biologist/
39