how to be a bioinformatician

38
1 How to be a bioinformatician Christian Frech, PhD St. Anna Children’s Cancer Research Institute, Vienna, Austria Talk at University of Applied Sciences, Hagenberg, Austria April 23rd, 2014

Upload: christian-frech

Post on 23-Aug-2014

454 views

Category:

Science


1 download

DESCRIPTION

Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.

TRANSCRIPT

Page 1: How to be a bioinformatician

1

How to be a bioinformatician

Christian Frech, PhDSt. Anna Children’s Cancer Research Institute, Vienna,

Austria

Talk at University of Applied Sciences, Hagenberg, AustriaApril 23rd, 2014

Page 2: How to be a bioinformatician

What is a bioinformatician?

2

Informatician Statistician

Biologist

Bio-

inform

aticia

n Bio-statistician

Data scientist

Modified from http://blog.fejes.ca/?p=2418

Page 3: How to be a bioinformatician

Bioinformatician vs. computational biologist

Asks biological questions Analyzes & interprets

biological data Runs existing programs Ad hoc scripting Perl, R, Python

3

IT savvy Builds & maintains

biological databases & Web sites

Designs & implements clever algorithms

C/C++, Java, Python

Bioinformatician Computational biologist

Grasp of computational subjectsmore less

Grasp of biological subjectsless more

or vice versa

Page 4: How to be a bioinformatician

Why do we need bioinformaticians? Amount of generated biological data requires sophisticated

computing for data management and analysis

Programmers lack biological knowledge Biologists don’t program The two don’t understand each other

4

http://www.youtube.com/watch?v=Hz1fyhVOjr4

Latest Illumina sequencer shipped last week (HiSeq v4 reagent kit) outputs1 terabase (TB) of data in 6 days1!

Biologists talks to statistician

1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn

Page 5: How to be a bioinformatician

What are bioinformaticians doing?

5

Page 6: How to be a bioinformatician

6

What are bioinformaticians doing?

Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014

Page 7: How to be a bioinformatician

Challenges as bioinformatician Biology is complex, not black and white

As many exceptions as rules (e.g.: define “gene”) No single optimal solution to a problem Results interpretable in many ways (story telling, cherry picking)

Understanding the biological question

Field is moving incredibly fast Lack of standards, immature/abandoned software

Standard of today obsolete tomorrow Much time spent on collecting/cleaning-up data, troubleshooting errors

Stay flexible, don’t overinvest in single platform/technology Hundreds of software tools and databases out there

Easy to get lost Important to understand their strengths and weaknesses

8

Page 8: How to be a bioinformatician

Which tools should I use?

9

179 tools

Heard of: 65%Used: 30%

Page 9: How to be a bioinformatician

10http://omictools.com/

Page 10: How to be a bioinformatician

Things to have in your bioinformatics toolbox

Linux command line Scripting language with

associated Bio* library (BioPerl, BioPython, R/Bioconductor, …)

Basic statistical tests, regression, p-values, maximum likelihood, multiple testing correction

Sequence alignment (FASTA & BLAST)

Biological databases Regular expressions Sequencing technologies Web technologies (HTML, XML,

…)

11

Advanced R skills Parallel/distributed computing DBMS, SQL (Semi-)compiled language (C/C++, Java) Dimensionality reduction (e.g. PCA) Cluster analysis Support Vector Machines Hidden Markov models Web framework (e.g. Django) Version control system (e.g. Git) Advanced text editor (Emacs, vim) IDE (e.g. Eclipse, NetBeans)

Must haves Highly recommended

Page 11: How to be a bioinformatician

Requirement Recommended Language

Speed matters, low-level programming

Rich-client enterprise application development

Text file processing (regex)

Statistical analysis, fancy plots

Rapid prototyping, readable & maintainable scripts

Workflow automation

What programming language should I learn?

12Be a jack of all trades, master of ONE!

Page 12: How to be a bioinformatician

Perl on decline, R and Python gaining popularity

13

http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming-languages.html

http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png

Perl most popular bioinformatics programming language in 2008

R and Python take the lead in 2014

Page 13: How to be a bioinformatician

Top 10 most common and/or annoying mistakes in

bioinformatics

14Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)

Page 14: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 10Using genome coordinates with wrong

genome version

(for example, using gene coordinates from human genomeversion hg18 but reference sequence from version hg19)

15

Page 15: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 9Forgetting to process the second strand of

DNA sequence

16

Page 16: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 8Processing second strand of DNA sequence,

but taking reverse instead of reverse complement sequence

17

Page 17: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 7Not accounting for different human

chromosomes names between UCSC and Ensembl

Example:

UCSC: “chr1”Ensembl: “1”

18

Page 18: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 6Assuming the alphabetical order of

chromosome names is

“chr1”, “chr2”, “chr3”, … when in fact it is

“chr1”, “chr10”, “chr11”, …

19

Page 19: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 5 Assuming ‘tab’ field separator

when in fact it is ‘blank’ (or vice versa)

(look almost identical in text editor)

20

Page 20: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 4Assuming DNA sequence consists of only

four letters (A, T, C, G) while in fact there is a fifth

21

‘N’ for missing base(‘X’ for missing amino acid)

Page 21: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 3Forgetting to use dos2unix on a Windows text file

before processing it under Linux

plus spending 1 hour to debug the problemplus being tricked by this multiple times

Text file line breaks differ between platforms: Linux (LF); Windows (CR+LF); classic Mac (CR).

22

Page 22: How to be a bioinformatician

Top-10 most common/annoying mistakes in bioinformatics

# 2When importing data into MS Excel, letting it auto-convert HUGO gene names into dates

and forgetting about it

(e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import)

~30 genes in total

23

Page 23: How to be a bioinformatician

#1Off-by-one error

There are only two common problems in bioinformatics:

(1) lack of standards, (2) ID conversion, and (3) off-by-one errors

24

http://en.wikipedia.org/wiki/Off-by-one_error

Top-10 most common/annoying mistakes in bioinformatics

Page 24: How to be a bioinformatician

Ten personal recommendations for your

future work as bioinformatician

25

Page 25: How to be a bioinformatician

#1 - Learn Linux! Most bioinformatics tools not available

on Windows Linux file systems better for many and/or very large files Command line interface (CLI) has advantages over

graphical user interface (GUI) Recorded command history (reproducibility) Key stroke to re-run analysis, instead of repeating 100 mouse

clicks Linux CLI (Shell) much more powerful than Windows CLI

26

Page 26: How to be a bioinformatician

# 2 - Embrace the “Unix tools philosophy” Small programs (“tools”) instead of monolithic applications

Designed for simple, specific tasks that are performed well (awk, cat, grep, wc, etc.)

Many and well documented parameters Combined with Unix pipes (read from STDIN, write to STDOUT)

cut -f 3 myfile.txt | sort | uniq Advantages

Great flexibility, easy re-use of existing tools Intermediate output can be stored and inspected for troubleshooting Complex tasks can be performed quickly with shell ‘one-liners’

This paradigm fits bioinformatics well, where often many heterogeneous data files need to be processed in many different ways

27http://www.linuxdevcenter.com/lpt/a/302

Page 27: How to be a bioinformatician

Example NGS use case demonstrating the power of the Unix tools philosophy

Explanation ‘samtools mpileup’ piles up short reads from the input BAM file for

each position in the reference genome ‘bcftools view’ calls the variants ‘vcfutils vcf2fq’ computes the consensus sequence The resulting FASTA sequence is redirected to the output file cns.fq

By knowing available tools and their parameters, bioinformatics ‘wizards’ can get complex stuff done in almost no time

28

samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

http://samtools.sourceforge.net/mpileup.shtml

Page 28: How to be a bioinformatician

#3 - Don’t reinvent the wheel Coding is fun, but look

around before you hack into your keyboard

Don’t write the 29th FASTA file parser if proven solutions are available

BioPerl BioPython Bioconductor

29

Page 29: How to be a bioinformatician

#4 - If you happen to invent a wheel, … Document source and parameters well Use version control system (git, svn) Deposit code in public repository

sourceforge.net github.com

Write test cases

30

Page 30: How to be a bioinformatician

# 5 - Automate pipelines with GNU/Make Developed in 1970s to build executables from

source files

Incredibly useful for data-driven workflows as well Automatic error checking Parallelization (utilize multiple cores) Incremental builds (re-start your pipeline from point of failure) Bug-free

Get started at http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/

31

Page 31: How to be a bioinformatician

# 6 - Value your time Architecture vs. accomplishment

“Perfect is the enemy of the good” -- Voltaire OO design and normalized databases are nice, but can be an

overkill if requirements change from analysis to analysis

Automate what can be automated Reproducibility Easy to repeat analysis with slightly changed parameters

BUT: Don’t spend two days automating a one-time analysis that can be done manually in 10 minutes

32

Page 32: How to be a bioinformatician

# 7 – Make use of free online resources to learn about specialized topics www.coursera.org

Bioinformatics Algorithms (https://www.coursera.org/course/bioinformatics)

Computing for Data Analysis (https://www.coursera.org/course/compdata)

R Programming (https://www.coursera.org/course/rprog)

https://www.edx.org/ Data Analysis for Genomics (

https://www.edx.org/course/harvardx/harvardx-ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)

Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x-introduction-biology-secret-1768#.U1TVL3V52R8)

http://rosalind.info/problems/locations/ 33

Page 33: How to be a bioinformatician

# 8 - Become an expert

Identify an area of interest and get really good at it

Work at places where you can learn from the best

Spend time abroad Great experience Labs/companies will not only hire you for what you

know, but who you know

34

Page 34: How to be a bioinformatician

# 9 - Decide early on if you want to stay in academia or go into industry

35

Academia Industry• PhD highly recommended• Take your time to find

compatible supervisor

+ Freedom to pursue own ideas+ Very flexible working hours+ Work independently

- Steep & competitive career ladder (postdoc >> PI/prof)

- Lower pay- Publish or perish

• PhD beneficial (to get in), but not necessarily required for daily work (e.g. build/maintain databases)

+ More frequent (positive) feedback

+ Higher pay+ Job security

- More (external) deadlines- Higher pressure to get things

done

See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)

Page 35: How to be a bioinformatician

# 10 - Stay informed & get connected Follow literature and blogs

http://en.wikipedia.org/wiki/List_of_bioinformatics_journals http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay-current-in-bioinformaticsgenomics/ Subscribe via RSS feeds

http://feedly.com or others Platform independent (e.g. read on your phone)

Bioinformatics Q&A forums http://www.biostars.org (highly recommended) http://seqanswers.com/ (focus on NGS) http://www.reddit.com/r/bioinformatics/ (student-oriented)

Other http://bioinformatics.org – fosters collaboration in bioinformatics http://www.researchgate.net – “Facebook” for researchers German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin)

36

Page 36: How to be a bioinformatician

Conclusion As bioinformatician, you will be at the

forefront of one of the greatest scientific enterprises of our time

Biologists overwhelmed with massive data sets

YOU will get to see exciting results first

Requires integration of knowledge from many domains IT, biology, medicine, statistics, math, …

Knowing your informatics toolbox AND understanding the biological question is what makes you very valuable

37

Page 37: How to be a bioinformatician

Thank you!Christian Frech

[email protected]

38

Page 38: How to be a bioinformatician

Further Reading “So you want to be a computational biologist?”

http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html “What It Takes to Be a Bioinformatician”

http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/ “The alternative ‘what it takes to be a bioinformatician’”

https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/ “So You Want To Be a Computational Biologist, Or A Bioinformatician?”

http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html “Being a bioinformatician is hard”

http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/ “How not to be a bioinformatician”

http://www.scfbm.org/content/7/1/3 “Ten Simple Rules for Reproducible Computational Research”

http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285 “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia”

http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2E21C9E565378C7F714D76

“A Quick Guide for Developing Effective Bioinformatics Programming Skills”http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589

“What Is Really the Salary of a Bioinformatician/Computational Biologist?”http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational-biologist/

39