setting up a bioinformatics qc pipeline · 3 baker, m. (2016). 1,500 scientists lift the lid on...

39
SETTING UP A BIOINFORMATICS QC PIPELINE BRIAN MCCONEGHY BIOINFORMATICS SPECIALIST SEQUENCING AND BIOINFORMATICS CONSORTIUM, UBC OFFICE OF THE VICE-PRESIDENT, RESEARCH & INNOVATION WESTGRID WEBINAR 2019-11-13 1

Upload: others

Post on 03-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

SETTING UP A BIOINFORMATICS QC PIPELINEBRIAN MCCONEGHY

BIOINFORMATICS SPECIALIST

SEQUENCING AND BIOINFORMATICS CONSORTIUM, UBC

OFFICE OF THE VICE-PRESIDENT, RESEARCH & INNOVATION

WESTGRID WEBINAR 2019-11-13

1

Page 2: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

2

“More than 70% of researchers have tried and

failed to reproduce another scientist's experiments,

and more than half have failed to reproduce their

own experiments.”

Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604),

452-454. doi:10.1038/533452a

Page 3: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

3

Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454.

doi:10.1038/533452a

Page 4: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

Background

Pipelining Tools

Writing the Pipeline

Metrics to Track

Implementation

Conclusions

4

Page 5: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

Background

Pipelining Tools

Writing the Pipeline

Metrics to Track

Implementation

Conclusions

5

Page 6: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

WHAT IS BIOINFORMATICS?

6

Page 7: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

NEXT GENERATION SEQUENCING

7

Sample Preparation SequencingSample (input) QC Sample (library) QC Data!

https://www.makeuseof.com/tag/best-linux-server-operating-systems/www.Illumina.com

https://www.nextadvance.com/ https://pdb101.rcsb.org/motm/84www.thermofisher.com

Page 8: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

NEXT GENERATION SEQUENCING

8

Sample Preparation SequencingSample (input) QC Sample (library) QC Data!

https://www.makeuseof.com/tag/best-linux-server-operating-systems/www.Illumina.com

https://www.nextadvance.com/ https://pdb101.rcsb.org/motm/84www.thermofisher.com

Page 9: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

WHAT IS NEXT GEN SEQUENCING DATA, EXACTLY?

• High-throughput sequencing technology

• Generates millions of ‘reads’

• Reads are just strings of G’s, A’s, T’s, and C’s (with associated quality values)

9

@NB999999:999:ZZZZZZZZ9:1:11101:16570:1094 1:N:0:1AAAGCNGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACT+AAAAA#A6EA66EEEAEEE/E//EAE/E//EEEEEEEEAAEEEAEEEEAEE

phiX 174 control DNA

Page 10: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

10

https://bloggenohub.files.wordpress.com/2015/01/slide1.jpg

Page 11: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

Background

Pipelining Tools

Writing the Pipeline

Metrics to Track

Implementation

Conclusions

11

Page 12: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

WHAT AND WHY

• Workflow management system

• Reproducible and scalable data analysis

• Rules, inputs, and outputs

12

Page 13: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

NEEDS

• Reproducible

• Scalable

• Efficient (Parallelizable)

• Portable

• Automated

13

https://slides.com/johanneskoester/snakemake-short#/3

Page 14: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

WANTS

• Ease of development

• Unix-compatible

• FREE

14

https://slides.com/johanneskoester/snakemake-short#/3

Page 15: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

COMPARISON

• Galaxy

• Ruffus

• Snakemake

15

Page 16: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

COMPARISON

• Galaxy

• Ruffus

•Snakemake

16

Page 17: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

Background

Pipelining Tools

Writing the Pipeline

Metrics to Track

Implementation

Conclusions

17

Page 18: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

RESOURCES - HARDWARE

• Cedar - Compute Canada

• 58,416 Cores

• 306,306 GB of RAM

• 10TB scratch space

18

https://medium.com/monplan/how-we-automated-deployments-

and-testing-with-bitbucket-pipelines-bb478c12c55f

Page 19: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

RESOURCES - PEOPLE

• Advanced Research Computing (ARC)

• Jamie Rosner

• Venkat Mahadevan

• VP Research & Innovation (VPRI)

• Dr. Helen Burt

19

https://medium.com/monplan/how-we-automated-deployments-

and-testing-with-bitbucket-pipelines-bb478c12c55f

Page 20: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

SNAKEMAKE

• Decompose workflow into rules

• Rules define how to obtain output files from input files

• Snakemake infers dependencies and execution order

20

Page 21: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

21

https://xkcd.com/1343/

SNAKEMAKE

Page 22: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

22

TECHNICAL

Page 23: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

SIMPLICITY!

rule sort:

input:

"path/to/dataset.txt"

output:

"dataset.sorted.txt"

shell:

"sort {input} > {output}"

23

Page 24: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

GENERALIZE RULES WITH NAMED WILDCARDS

24

rule sort:input:

"path/to/{dataset}.txt"output:

"{dataset}.sorted.txt"shell:

"sort {input} > {output}"

Page 25: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

SPECIFY MULTIPLE INPUTS (AND OUTPUTS)REFER BY INDEX

25

rule sort_and_annotate:input:

"path/to/{dataset}.txt","path/to/annotation.txt"

output:"{dataset}.sorted.txt"

shell:"paste <(sort {input[0]}) {input[1]} > {output}"

Page 26: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

CAN SPECIFY MULTIPLE INPUTS (AND OUTPUTS), AND REFER BY NAME

26

rule sort_and_annotate:input:

a="path/to/{dataset}.txt",b="path/to/annotation.txt"

output:"{dataset}.sorted.txt"

shell:"paste <(sort {input.a}) {input.b} > {output}"

Page 27: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

USE PYTHONWITHIN RULES

27

rule sort:input:

a="path/to/{dataset}.txt"output:

b="{dataset}.sorted.txt"run:

with open(output.b, "w") as out:for l in sorted(open(input.a)):

print(l, file=out)

Page 28: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

A REAL RULE

28

• Used in DNA QC pipeline

rule bwa_mem_map_reads:input:

get_trimmed_readsoutput:

temp('mapped/{sample}-{unit}.sorted.bam')log:

'logs/bwa_mem/{sample}-{unit}.log'params:

index = get_genome_index,rg = get_read_group_bwa

threads: 46shell:

'(bwa mem -t {threads} {params.rg} {params.index} {input} | ''samtools sort -T $SLURM_TMPDIR/ -o {output} -) 2> {log}'

Page 29: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

JOB EXECUTION

29

• A job only executes if:

1. output file is the target requested and does not exist

2. output file needed by another executed job (i.e. is an

input to another job) and does not exist

3. input file is newer than the output file

4. input file will be updated by other job

5. execution is forced

Page 30: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

CLUSTER EXECUTION

30

• Can set up pipeline profiles

• Execute DAG by way of cluster job submission

• Configuration file

• Max jobs at a time

• CPUs

• MEM

• General (per profile) or granular (per rule)

Page 31: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

Background

Pipelining Tools

Metrics to Track

Writing the Pipeline

Implementation

Conclusions

31

Page 32: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

METRICS TO TRACK

• Adapter trimming

• Duplicate Rate

• % Aligned (for genomes we can map to)

• Insert size

• Coverage

• Error rate

• GC Content

• For RNA, specifically:

• Strand specificity (% correct strand)

• 5’-3’ bias

• % rRNA

• Intron-exon ratio

32

Page 33: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

Background

Pipelining Tools

Writing the Pipeline

Metrics to Track

Implementation

Conclusions

33

Page 34: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

IMPLEMENTATION

• Conda environment

• Version controlled - GitHub

• SBC has 3 QC pipelines (combined into 1, dynamically determined)

• Paired-end DNA QC

• Single-end DNA QC

• Paired-end RNA QC

34

Page 35: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

DNA QC WORKFLOW – 2 SAMPLES

35

Page 36: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

WORKFLOWS CAN BE COMPLEX

36

Leipzig, J. (2016). A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics. doi:10.1093/bib/bbw020

Page 37: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

Background

Pipelining Tools

Writing the Pipeline

Metrics to Track

Implementation

Conclusions

37

Page 38: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

CONCLUSIONS

• Snakemake satisfied all needs of the SBC and is simple to work with

• The complex metrics the SBC is most interested in are being tracked, in an

automated fashion

• Implementation allows for reproducible, scalable, flexible, trackable QC

38

Page 39: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a

THANK YOU

39

Link to GitHub with workshop instructions:

https://bit.ly/2Xf4HN6