mcbl documentation

59
MCBL Documentation Release 1.0 Saranga Jul 07, 2020

Upload: others

Post on 16-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

MCBL DocumentationRelease 10

Saranga

Jul 07 2020

About The MCBL

1 MCIC Computational Biology Lab (MCBL) 3

2 MCBL membership 5

3 MCBL servers and computing resources 7

4 Download from BaseSpace 11

5 Share runs and projects in BaseSpace 17

6 Using BaseMount 23

7 Download from hudsonalphaorg 31

8 Filter a CASAVA-generated fastq file 33

9 Fastq adapter removal and QC 35

10 DESeq2 with phyloseq 39

11 GBS 43

12 A basic microbiome analysis 47

13 Indices and tables 51

Python Module Index 53

Index 55

i

ii

MCBL Documentation Release 10

About The MCBL 1

MCBL Documentation Release 10

2 About The MCBL

CHAPTER 1

MCIC Computational Biology Lab (MCBL)

11 Our goal

Our mission is to build core support and intellectual leadership in the area of bioinformatics to support research atthe OARDC by providing an engaging work environment space infrastructure and training for performing researchinvolving biological data analysis

We aspire for the MCBL to become the place to be for learning and performing bioinformatics research at the OARDCthe place where ideas are discussed and exchanged students and users learn from each other and get help and supportfrom our experienced staff when needed and we as a community move our bioinformatics knowledge forward

We specialize in working with High-throughput Sequencing (HTS) data and providing training in reproducible scienceusing modern tools

We are happy to help you to carry out your own analysis This will include helping with experimental design dis-cussing the most effective way to carry out your data analysis providing the computational infrastructure (softwarescripts and computers) interpreting results and answering questions you might come across along the way We canalso process and analyze HTS data for you using either standardized or custom pipelines

12 MCBL services

Data analysis Contacts

bull Processing analyzing and interpreting HTS databull WGS RADseqGBS RNAseq metagenomics microbiomics bull Quality controlbull Genotypingbull Genome assembly and annotationbull Differential expression analysesbull Population genetic and genomic analysesbull GWAS (e)QTL analyses

Jelmer Poelstra

3

MCBL Documentation Release 10

Workshops amp training Contacts

bull UNIX command line and bash scriptingbull Running analyses at the Ohio Supercomputer Center (OSC)bull Running analyses in the cloudbull R - general bioinformatics R markdown dashboardingbull Python - general bioinformaticsbull Reproducible compute environments with containers and condabull Reproducible analysis pipelines with Snakemakebull Version control with git and Githubbull Data management and deposition in public repositories

Jelmer Poelstra

High-throughput Sequencing services Contacts

bull Illumina Miseq bull Tea Meulia (di-rector)

bull Fiorella Cis-neros Carter

See also

Main MCIC page for more details on sequencing services

4 Chapter 1 MCIC Computational Biology Lab (MCBL)

CHAPTER 2

MCBL membership

21 MCBL membership benefits

bull Access to the MCBL and MCBL computers 247

bull Free access to all MCBL activities workshops user group meeting etc

bull Access to 1 TB data storage space for the duration of the membership

bull Assistance with experimental design bioinformatic analyses and interpretation of your data

22 How to apply for an MCBL membership

Step 1 Fill out and submit the MCBL application form MCBL Application

Step 2 Submit your membership fee to MCIC

Step 3 Contact Jelmer Poelstra for login credentials

Note Access to MCBL resources will be granted till we receive the payment Once the form is completed andsubmitted a notification e-mail will be sent to the membership applicant and the PI

23 MCBL membership duration

Membership is offered for a 6 month or 1 year period at a time

5

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

About The MCBL

1 MCIC Computational Biology Lab (MCBL) 3

2 MCBL membership 5

3 MCBL servers and computing resources 7

4 Download from BaseSpace 11

5 Share runs and projects in BaseSpace 17

6 Using BaseMount 23

7 Download from hudsonalphaorg 31

8 Filter a CASAVA-generated fastq file 33

9 Fastq adapter removal and QC 35

10 DESeq2 with phyloseq 39

11 GBS 43

12 A basic microbiome analysis 47

13 Indices and tables 51

Python Module Index 53

Index 55

i

ii

MCBL Documentation Release 10

About The MCBL 1

MCBL Documentation Release 10

2 About The MCBL

CHAPTER 1

MCIC Computational Biology Lab (MCBL)

11 Our goal

Our mission is to build core support and intellectual leadership in the area of bioinformatics to support research atthe OARDC by providing an engaging work environment space infrastructure and training for performing researchinvolving biological data analysis

We aspire for the MCBL to become the place to be for learning and performing bioinformatics research at the OARDCthe place where ideas are discussed and exchanged students and users learn from each other and get help and supportfrom our experienced staff when needed and we as a community move our bioinformatics knowledge forward

We specialize in working with High-throughput Sequencing (HTS) data and providing training in reproducible scienceusing modern tools

We are happy to help you to carry out your own analysis This will include helping with experimental design dis-cussing the most effective way to carry out your data analysis providing the computational infrastructure (softwarescripts and computers) interpreting results and answering questions you might come across along the way We canalso process and analyze HTS data for you using either standardized or custom pipelines

12 MCBL services

Data analysis Contacts

bull Processing analyzing and interpreting HTS databull WGS RADseqGBS RNAseq metagenomics microbiomics bull Quality controlbull Genotypingbull Genome assembly and annotationbull Differential expression analysesbull Population genetic and genomic analysesbull GWAS (e)QTL analyses

Jelmer Poelstra

3

MCBL Documentation Release 10

Workshops amp training Contacts

bull UNIX command line and bash scriptingbull Running analyses at the Ohio Supercomputer Center (OSC)bull Running analyses in the cloudbull R - general bioinformatics R markdown dashboardingbull Python - general bioinformaticsbull Reproducible compute environments with containers and condabull Reproducible analysis pipelines with Snakemakebull Version control with git and Githubbull Data management and deposition in public repositories

Jelmer Poelstra

High-throughput Sequencing services Contacts

bull Illumina Miseq bull Tea Meulia (di-rector)

bull Fiorella Cis-neros Carter

See also

Main MCIC page for more details on sequencing services

4 Chapter 1 MCIC Computational Biology Lab (MCBL)

CHAPTER 2

MCBL membership

21 MCBL membership benefits

bull Access to the MCBL and MCBL computers 247

bull Free access to all MCBL activities workshops user group meeting etc

bull Access to 1 TB data storage space for the duration of the membership

bull Assistance with experimental design bioinformatic analyses and interpretation of your data

22 How to apply for an MCBL membership

Step 1 Fill out and submit the MCBL application form MCBL Application

Step 2 Submit your membership fee to MCIC

Step 3 Contact Jelmer Poelstra for login credentials

Note Access to MCBL resources will be granted till we receive the payment Once the form is completed andsubmitted a notification e-mail will be sent to the membership applicant and the PI

23 MCBL membership duration

Membership is offered for a 6 month or 1 year period at a time

5

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

ii

MCBL Documentation Release 10

About The MCBL 1

MCBL Documentation Release 10

2 About The MCBL

CHAPTER 1

MCIC Computational Biology Lab (MCBL)

11 Our goal

Our mission is to build core support and intellectual leadership in the area of bioinformatics to support research atthe OARDC by providing an engaging work environment space infrastructure and training for performing researchinvolving biological data analysis

We aspire for the MCBL to become the place to be for learning and performing bioinformatics research at the OARDCthe place where ideas are discussed and exchanged students and users learn from each other and get help and supportfrom our experienced staff when needed and we as a community move our bioinformatics knowledge forward

We specialize in working with High-throughput Sequencing (HTS) data and providing training in reproducible scienceusing modern tools

We are happy to help you to carry out your own analysis This will include helping with experimental design dis-cussing the most effective way to carry out your data analysis providing the computational infrastructure (softwarescripts and computers) interpreting results and answering questions you might come across along the way We canalso process and analyze HTS data for you using either standardized or custom pipelines

12 MCBL services

Data analysis Contacts

bull Processing analyzing and interpreting HTS databull WGS RADseqGBS RNAseq metagenomics microbiomics bull Quality controlbull Genotypingbull Genome assembly and annotationbull Differential expression analysesbull Population genetic and genomic analysesbull GWAS (e)QTL analyses

Jelmer Poelstra

3

MCBL Documentation Release 10

Workshops amp training Contacts

bull UNIX command line and bash scriptingbull Running analyses at the Ohio Supercomputer Center (OSC)bull Running analyses in the cloudbull R - general bioinformatics R markdown dashboardingbull Python - general bioinformaticsbull Reproducible compute environments with containers and condabull Reproducible analysis pipelines with Snakemakebull Version control with git and Githubbull Data management and deposition in public repositories

Jelmer Poelstra

High-throughput Sequencing services Contacts

bull Illumina Miseq bull Tea Meulia (di-rector)

bull Fiorella Cis-neros Carter

See also

Main MCIC page for more details on sequencing services

4 Chapter 1 MCIC Computational Biology Lab (MCBL)

CHAPTER 2

MCBL membership

21 MCBL membership benefits

bull Access to the MCBL and MCBL computers 247

bull Free access to all MCBL activities workshops user group meeting etc

bull Access to 1 TB data storage space for the duration of the membership

bull Assistance with experimental design bioinformatic analyses and interpretation of your data

22 How to apply for an MCBL membership

Step 1 Fill out and submit the MCBL application form MCBL Application

Step 2 Submit your membership fee to MCIC

Step 3 Contact Jelmer Poelstra for login credentials

Note Access to MCBL resources will be granted till we receive the payment Once the form is completed andsubmitted a notification e-mail will be sent to the membership applicant and the PI

23 MCBL membership duration

Membership is offered for a 6 month or 1 year period at a time

5

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

About The MCBL 1

MCBL Documentation Release 10

2 About The MCBL

CHAPTER 1

MCIC Computational Biology Lab (MCBL)

11 Our goal

Our mission is to build core support and intellectual leadership in the area of bioinformatics to support research atthe OARDC by providing an engaging work environment space infrastructure and training for performing researchinvolving biological data analysis

We aspire for the MCBL to become the place to be for learning and performing bioinformatics research at the OARDCthe place where ideas are discussed and exchanged students and users learn from each other and get help and supportfrom our experienced staff when needed and we as a community move our bioinformatics knowledge forward

We specialize in working with High-throughput Sequencing (HTS) data and providing training in reproducible scienceusing modern tools

We are happy to help you to carry out your own analysis This will include helping with experimental design dis-cussing the most effective way to carry out your data analysis providing the computational infrastructure (softwarescripts and computers) interpreting results and answering questions you might come across along the way We canalso process and analyze HTS data for you using either standardized or custom pipelines

12 MCBL services

Data analysis Contacts

bull Processing analyzing and interpreting HTS databull WGS RADseqGBS RNAseq metagenomics microbiomics bull Quality controlbull Genotypingbull Genome assembly and annotationbull Differential expression analysesbull Population genetic and genomic analysesbull GWAS (e)QTL analyses

Jelmer Poelstra

3

MCBL Documentation Release 10

Workshops amp training Contacts

bull UNIX command line and bash scriptingbull Running analyses at the Ohio Supercomputer Center (OSC)bull Running analyses in the cloudbull R - general bioinformatics R markdown dashboardingbull Python - general bioinformaticsbull Reproducible compute environments with containers and condabull Reproducible analysis pipelines with Snakemakebull Version control with git and Githubbull Data management and deposition in public repositories

Jelmer Poelstra

High-throughput Sequencing services Contacts

bull Illumina Miseq bull Tea Meulia (di-rector)

bull Fiorella Cis-neros Carter

See also

Main MCIC page for more details on sequencing services

4 Chapter 1 MCIC Computational Biology Lab (MCBL)

CHAPTER 2

MCBL membership

21 MCBL membership benefits

bull Access to the MCBL and MCBL computers 247

bull Free access to all MCBL activities workshops user group meeting etc

bull Access to 1 TB data storage space for the duration of the membership

bull Assistance with experimental design bioinformatic analyses and interpretation of your data

22 How to apply for an MCBL membership

Step 1 Fill out and submit the MCBL application form MCBL Application

Step 2 Submit your membership fee to MCIC

Step 3 Contact Jelmer Poelstra for login credentials

Note Access to MCBL resources will be granted till we receive the payment Once the form is completed andsubmitted a notification e-mail will be sent to the membership applicant and the PI

23 MCBL membership duration

Membership is offered for a 6 month or 1 year period at a time

5

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

2 About The MCBL

CHAPTER 1

MCIC Computational Biology Lab (MCBL)

11 Our goal

Our mission is to build core support and intellectual leadership in the area of bioinformatics to support research atthe OARDC by providing an engaging work environment space infrastructure and training for performing researchinvolving biological data analysis

We aspire for the MCBL to become the place to be for learning and performing bioinformatics research at the OARDCthe place where ideas are discussed and exchanged students and users learn from each other and get help and supportfrom our experienced staff when needed and we as a community move our bioinformatics knowledge forward

We specialize in working with High-throughput Sequencing (HTS) data and providing training in reproducible scienceusing modern tools

We are happy to help you to carry out your own analysis This will include helping with experimental design dis-cussing the most effective way to carry out your data analysis providing the computational infrastructure (softwarescripts and computers) interpreting results and answering questions you might come across along the way We canalso process and analyze HTS data for you using either standardized or custom pipelines

12 MCBL services

Data analysis Contacts

bull Processing analyzing and interpreting HTS databull WGS RADseqGBS RNAseq metagenomics microbiomics bull Quality controlbull Genotypingbull Genome assembly and annotationbull Differential expression analysesbull Population genetic and genomic analysesbull GWAS (e)QTL analyses

Jelmer Poelstra

3

MCBL Documentation Release 10

Workshops amp training Contacts

bull UNIX command line and bash scriptingbull Running analyses at the Ohio Supercomputer Center (OSC)bull Running analyses in the cloudbull R - general bioinformatics R markdown dashboardingbull Python - general bioinformaticsbull Reproducible compute environments with containers and condabull Reproducible analysis pipelines with Snakemakebull Version control with git and Githubbull Data management and deposition in public repositories

Jelmer Poelstra

High-throughput Sequencing services Contacts

bull Illumina Miseq bull Tea Meulia (di-rector)

bull Fiorella Cis-neros Carter

See also

Main MCIC page for more details on sequencing services

4 Chapter 1 MCIC Computational Biology Lab (MCBL)

CHAPTER 2

MCBL membership

21 MCBL membership benefits

bull Access to the MCBL and MCBL computers 247

bull Free access to all MCBL activities workshops user group meeting etc

bull Access to 1 TB data storage space for the duration of the membership

bull Assistance with experimental design bioinformatic analyses and interpretation of your data

22 How to apply for an MCBL membership

Step 1 Fill out and submit the MCBL application form MCBL Application

Step 2 Submit your membership fee to MCIC

Step 3 Contact Jelmer Poelstra for login credentials

Note Access to MCBL resources will be granted till we receive the payment Once the form is completed andsubmitted a notification e-mail will be sent to the membership applicant and the PI

23 MCBL membership duration

Membership is offered for a 6 month or 1 year period at a time

5

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 1

MCIC Computational Biology Lab (MCBL)

11 Our goal

Our mission is to build core support and intellectual leadership in the area of bioinformatics to support research atthe OARDC by providing an engaging work environment space infrastructure and training for performing researchinvolving biological data analysis

We aspire for the MCBL to become the place to be for learning and performing bioinformatics research at the OARDCthe place where ideas are discussed and exchanged students and users learn from each other and get help and supportfrom our experienced staff when needed and we as a community move our bioinformatics knowledge forward

We specialize in working with High-throughput Sequencing (HTS) data and providing training in reproducible scienceusing modern tools

We are happy to help you to carry out your own analysis This will include helping with experimental design dis-cussing the most effective way to carry out your data analysis providing the computational infrastructure (softwarescripts and computers) interpreting results and answering questions you might come across along the way We canalso process and analyze HTS data for you using either standardized or custom pipelines

12 MCBL services

Data analysis Contacts

bull Processing analyzing and interpreting HTS databull WGS RADseqGBS RNAseq metagenomics microbiomics bull Quality controlbull Genotypingbull Genome assembly and annotationbull Differential expression analysesbull Population genetic and genomic analysesbull GWAS (e)QTL analyses

Jelmer Poelstra

3

MCBL Documentation Release 10

Workshops amp training Contacts

bull UNIX command line and bash scriptingbull Running analyses at the Ohio Supercomputer Center (OSC)bull Running analyses in the cloudbull R - general bioinformatics R markdown dashboardingbull Python - general bioinformaticsbull Reproducible compute environments with containers and condabull Reproducible analysis pipelines with Snakemakebull Version control with git and Githubbull Data management and deposition in public repositories

Jelmer Poelstra

High-throughput Sequencing services Contacts

bull Illumina Miseq bull Tea Meulia (di-rector)

bull Fiorella Cis-neros Carter

See also

Main MCIC page for more details on sequencing services

4 Chapter 1 MCIC Computational Biology Lab (MCBL)

CHAPTER 2

MCBL membership

21 MCBL membership benefits

bull Access to the MCBL and MCBL computers 247

bull Free access to all MCBL activities workshops user group meeting etc

bull Access to 1 TB data storage space for the duration of the membership

bull Assistance with experimental design bioinformatic analyses and interpretation of your data

22 How to apply for an MCBL membership

Step 1 Fill out and submit the MCBL application form MCBL Application

Step 2 Submit your membership fee to MCIC

Step 3 Contact Jelmer Poelstra for login credentials

Note Access to MCBL resources will be granted till we receive the payment Once the form is completed andsubmitted a notification e-mail will be sent to the membership applicant and the PI

23 MCBL membership duration

Membership is offered for a 6 month or 1 year period at a time

5

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

Workshops amp training Contacts

bull UNIX command line and bash scriptingbull Running analyses at the Ohio Supercomputer Center (OSC)bull Running analyses in the cloudbull R - general bioinformatics R markdown dashboardingbull Python - general bioinformaticsbull Reproducible compute environments with containers and condabull Reproducible analysis pipelines with Snakemakebull Version control with git and Githubbull Data management and deposition in public repositories

Jelmer Poelstra

High-throughput Sequencing services Contacts

bull Illumina Miseq bull Tea Meulia (di-rector)

bull Fiorella Cis-neros Carter

See also

Main MCIC page for more details on sequencing services

4 Chapter 1 MCIC Computational Biology Lab (MCBL)

CHAPTER 2

MCBL membership

21 MCBL membership benefits

bull Access to the MCBL and MCBL computers 247

bull Free access to all MCBL activities workshops user group meeting etc

bull Access to 1 TB data storage space for the duration of the membership

bull Assistance with experimental design bioinformatic analyses and interpretation of your data

22 How to apply for an MCBL membership

Step 1 Fill out and submit the MCBL application form MCBL Application

Step 2 Submit your membership fee to MCIC

Step 3 Contact Jelmer Poelstra for login credentials

Note Access to MCBL resources will be granted till we receive the payment Once the form is completed andsubmitted a notification e-mail will be sent to the membership applicant and the PI

23 MCBL membership duration

Membership is offered for a 6 month or 1 year period at a time

5

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 2

MCBL membership

21 MCBL membership benefits

bull Access to the MCBL and MCBL computers 247

bull Free access to all MCBL activities workshops user group meeting etc

bull Access to 1 TB data storage space for the duration of the membership

bull Assistance with experimental design bioinformatic analyses and interpretation of your data

22 How to apply for an MCBL membership

Step 1 Fill out and submit the MCBL application form MCBL Application

Step 2 Submit your membership fee to MCIC

Step 3 Contact Jelmer Poelstra for login credentials

Note Access to MCBL resources will be granted till we receive the payment Once the form is completed andsubmitted a notification e-mail will be sent to the membership applicant and the PI

23 MCBL membership duration

Membership is offered for a 6 month or 1 year period at a time

5

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

24 MCBL membership termination

MCBL membership will be terminated after the membership period ends or upon a written request from user or PI toterminate the membership

25 Contacts

Person InformationTea Meulia (direc-tor)

Questions regarding membership

Jelmer Poelstra MCBL server access and remote accessJody Whittier MCBL payments

6 Chapter 2 MCBL membership

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 3

MCBL servers and computing resources

31 Servers

Server Processors Cores MemoryLocalDisk

mcic-ender-svr four 240GHz ten-core Intelreg Xeon processors E7-4870 40 10TB

16TB

mcic-ender-svr2 four 200GHz ten-core Intelreg Xeon processors E7-4850 40 12TB

10TB

mcic-ent-srvr two 267GHz six-core Intelreg Xeon processors X7542 12 250GB 20TB

32 Workstations

Workstation Processors Cores MemoryLocalDisk

mcic-galaxy-srvr two 347GHz six-core Intelreg Xeon processors X5690 12 94GB

26TB

mcic-mac-srvr two 293GHz six-core Intelreg Xeon processors X5670 12 64GB

40TB

7

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

33 Desktops

Desktop Processors Cores MemoryLocalDisk

mcic-sel019-d1 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d2 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d3 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d4 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d5 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d6 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

mcic-sel019-d7 one 300GHz four-core Intelreg Xeon processors i7-4578U 7 32GB

10TB

34 Software

The following bioinformatics software are availabe through the MCBL Please contact Jelmer Poelstra for details onavailability of the software

341 Commercial Software

Application Version DescriptionCLCBio Work-bench

851 A comprehensive and user-friendly analysis package for analyzing com-paring and visualizing next generation sequencing data

Blast2GO Pro Pro A complete framework for functional annotation and analysis

8 Chapter 3 MCBL servers and computing resources

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

342 Open Source Software

Application Version Description ModuleName

Bowtie 1102-223 An ultrafast and memory-efficient tool for aligning sequencingreads to long reference sequences

Bowtie-ltversiongt

Cd-hit 461 A very widely used program for clustering and comparing pro-tein or nucleotide sequences

cd-hit-vltversiongt

Cutadapt 142181 Removes adapter sequences from high-throughput sequencingdata

Cutadaptltversiongt

Exonerate 220 A generic tool for pairwise sequence comparison ExonerateltversiongtExpress 151 eXpress is a streaming tool for quantifying the abundances of

a set of target sequences from sampled subsequencesexpress-ltversiongt

Fastqc 151 Aims to provide a simple way to do some quality controlchecks on raw sequence data coming from high throughputsequencing pipelines

Fastqc-ltversiongt

GenomeAnalysisTK32-2 GATK tools for error modeling data compression and variantcalling

GenomeAnalysisTK-ltversiongt

Maker 2318 MAKER is a portable and easily configurable genome annota-tion pipeline

Makerltversiongt

Mothur 133135 Tool for analyzing 16S rRNA gene sequences Mothur-ltversiongt

Mummer 323 A system for rapidly aligning entire genomes whether in com-plete or draft form

Mummerltversiongt

Pandaseq 28 A program to align Illumina reads optionally with PCRprimers embedded in the sequence and reconstruct an over-lapping sequence

Pandaseqltversiongt

Qiime 18 A program for comparison and analysis of microbial commu-nities primarily based on high-throughput amplicon sequenc-ing data

Qiime-ltversiongt

Rsem 1216 A software package for estimating gene and isoform expres-sion levels from RNA-Seq data

rsem-ltversiongt

Samtools 0119 Provides various utilities for manipulating alignments in theSAM format including sorting merging indexing and generat-ing alignments in a per-position format

Samtools-ltversiongt

SNAP 0119 A new sequence aligner that is 3-20x faster and just as accurateas existing tools like BWA-mem Bowtie2 and Novoalign

SNAPltversiongt

Trim-fastq 122 A Fastq quality trimmer Trim-fastq-ltversiongt

Trinity r20140717 a novel method for the efficient and robust de novo reconstruc-tion of transcriptomes from RNA-seq data

Trinity

34 Software 9

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

10 Chapter 3 MCBL servers and computing resources

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 4

Download from BaseSpace

Note

Purpose This short tutorial will show you how to download MiSeq sequencing data from Illumina BaseS-pace

Author Wirat Pipatpongpinyo

Date July 6 2020

41 Step-by-step guide to download your sequence data from BaseS-pace

1 You will receive two invitation emails frombasespace-noreplyilluminacom for the trans-fer of ownership to you one for the sequencing runand another for the project To be able to fullyaccess the data you must accept both invitations

11

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

2 Once you click the link (Click here to acceptthis transfer of ownership) in one of the invi-tation emails you will be taken to the Illumina BaseSpacewebsite where you can log in to your account

3 After you have logged in BaseSpace will bring you tothe DASHBOARD tab Click Accept in both boxes to takethe ownership of both the run and project

4 Once transfer is acceptedclick on the name of the run

12 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

5 Under the SUMMARY tab click Download

6 The Download Run screen will pop upIf thisis the first time you download from BaseSpace you willneed to install the downloader software click Installthe BaseSpace Sequence Hub Downloader

41 Step-by-step guide to download your sequence data from BaseSpace 13

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

7 After the BaseSpace downloader has beeninstalled select FASTQ as the file typeand click Download

14 Chapter 4 Download from BaseSpace

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

8 The Confirm Download screen will pop up whereyou can select a directory to download the filesinto In this case the files will be stored atCBaseSpace Click START DOWNLOAD

41 Step-by-step guide to download your sequence data from BaseSpace 15

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

16 Chapter 4 Download from BaseSpace

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 5

Share runs and projects in BaseSpace

Note

Purpose These two short tutorials will show you how to share sequencing runs and projects respectivelyon Illumina BaseSpace

Author Wirat Pipatpongpinyo

Date July 6 2020

bull Sharing runs

bull Sharing projects

51 Sharing runs

17

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

1 Go to the Illumina BaseSpace website and sign in

2 Click on the RUNS tab

3 Select the name of the runyou want to share

18 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

4 Go to the SUMMARYtab

5 Click the SHAREbutton

6 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator

51 Sharing runs 19

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

8 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects

1 Got to the Illumina BaseSpace website and sign in

2 Click on the PROJECTS tab

3 Select the Project that youwould like to share

20 Chapter 5 Share runs and projects in BaseSpace

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

4 Click the Share Project button

5 Enter the email address (linked to aBaseSpace account) of yourcollaborator

6 Click Add Collaborator(Repeat Steps 6 and 7 if you want toshare with multiple collaborators)

7 Click Save Settings Yourcollaborator will receive an email frombasespace-noreplyilluminacom

52 Sharing projects 21

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

22 Chapter 5 Share runs and projects in BaseSpace

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 6

Using BaseMount

bull Introduction to BaseMount

bull Getting Started with BaseMount

ndash Example

bull Basic analysis of fastq files

bull Basic analysis on Alignment files (BAM)

bull Metadata

bull Limitations of BaseMount

61 Introduction to BaseMount

What is BaseMount

BaseMount is Illumina software that enables access to your BaseSpace storage as a Linux file system on the commandline

Advantages of BaseMount

bull Have access to your Projects Runs and App results as your local files

bull Can run local apps on basespace data without downloading data to your local computer

bull Can save your local storage space

ldquoAlthough BaseMount does facilitate file download we would recommend that since BaseMountallows convenient fast cached access to your BaseSpace metadata and files you may find that many

23

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

operations can be carried out without the need to download locally During our testing we have usedBaseMount to grep through fastq files extract blocks of reads from bam files and even use IGV on thebam files directly all without downloading files locally This can be more convenient than includinga download step and saves on the overheads of local storagerdquo -Illumina

Official page

62 Getting Started with BaseMount

Quick Install

1 sudo bash -c $(curl -L httpsbasemountbasespaceilluminacominstall)

Manual install

Supported Operating Systems Ubuntu CentOS

Ubuntu 14 amp 15

1 wget httpsbintraycomartifactdownloadbasespaceBaseSpaceFS-DEBbsfs_11631-1_rarr˓amd64deb

2 wget httpsbintraycomartifactdownloadbasespaceBaseMount-DEBbasemount_012rarr˓463-20150714_amd64deb

3 sudo dpkg -i --force-confmiss bsfs_11631-1_amd64deb4 sudo dpkg -i basemount_012463-20150714_amd64deb

Mounting Your BaseSpace Account

1 basemount --config config_file_prefix mount-point folder2 basemount --config user1 ~BaseSpace_Mount

621 Example

1 mkdir exportNFSSarangaBaseSpace2 mkdir exportNFSMariaBaseSpace3 basemount --config Maria exportNFSMariaBaseSpace

Thenopenyourin-ter-netbrowser

AfteryouclickldquoAc-ceptrdquo

24 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

Toac-cessthefoldertype

1 cdrarr˓rarr˓export

rarr˓NFSrarr˓Mariararr˓BaseSpacerarr˓

2 ls

Projectsrarr˓

rarr˓Runs

Toseethe

contents in the Projects

1 ls -lsh exportNFSSarangaBaseSpaceProjectsHiSeq 2500 - v4 reagentsrarr˓TruSeq PCR Free (4 replicates of NA12877)SamplesNA12877_Files

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain10Filestotal 85M85M -r--r--r-- 1 root root 85M Oct 5 1409 Brain10_S22_L001_R1_001fastqgz

exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNA Expression (NFkBrarr˓amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain11Filestotal 62M62M -r--r--r-- 1 root root 62M Oct 5 1409 Brain11_S23_L001_R1_001fastqgz

63 Basic analysis of fastq files

You can get basic information about your fastq files without having to download them For instance

bull View sequences inside Fastq files

bull Get the number of reads for each fastq file

bull Get basic statistics and read length distribution

Example View your data

63 Basic analysis of fastq files 25

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | head -n 4

M0343848000000000-AGGNU11101117921006 1N013NTCAATCCCCAGCAGTGGAATAAGGCCTGTTGTTCCTTGCAGTGGATCCTG+88ABFFGCFEEGltFFltFDFFEGGFGGFCFGFFGGEGGGGGGFGGFGGGGG

Example Count the number of sequences using native Linux commands

1 zcat exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)SamplesBrain1rarr˓FilesBrain1_S13_L001_R1_001fastqgz | grep -c M03438

838876

FILE SIZE 85M

TIME TAKEN 1327s

Example Count the number of sequneces using using fastqutils

1 fastqutils names exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz | wc -l

838876

FILE SIZE 85M

TIME TAKEN 2300s

Example Get the Length distribution and other statistics

1 fastqutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓SamplesBrain1FilesBrain1_S13_L001_R1_001fastqgz

Space basespacePairing FragmentedQuality scale IlluminaNumber of reads 838876

Length distributionMean 510StdDev 00Min 5125 percentile 51Median 5175 percentile 51Max 51

(continues on next page)

26 Chapter 6 Using BaseMount

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

(continued from previous page)

Total 838876

Quality distributionpos mean stdev min 25pct 50pct 75pct max count1 335948650337 215033983948 2 34 34 34 34 8388762 336285434319 201137931152 12 34 34 34 34 8388763 336910246568 181306178651 11 34 34 34 34 8388764 336861812711 184720681841 11 34 34 34 34 8388765 336998292954 173881480815 12 34 34 34 34 8388766 372788564698 249830369426 2 38 38 38 38 8388767 373219820331 239576158974 2 38 38 38 38 8388768 371795640834 272499009837 2 38 38 38 38 8388769 37158373824 276769991544 10 38 38 38 38 83887610 370991517221 293041515545 10 38 38 38 38 83887611 37080357526 300286915879 10 38 38 38 38 83887612 370752506926 293530617165 10 38 38 38 38 83887613 371372705859 285104291311 10 38 38 38 38 83887614 370559641711 300820427859 10 38 38 38 38 83887615 370753877808 299577677236 10 38 38 38 38 83887616 370950426523 292821324956 10 38 38 38 38 83887617 37204973083 264727500649 10 38 38 38 38 83887618 371618308308 275627448135 10 38 38 38 38 83887619 370932426247 292692822638 10 38 38 38 38 83887620 371103548081 289001817524 10 38 38 38 38 83887621 37117659821 290476888945 10 38 38 38 38 83887622 371280034236 280218109494 10 38 38 38 38 83887623 369527546383 318334971719 9 37 38 38 38 83887624 371825096915 270092644993 10 38 38 38 38 83887625 372478948021 258021368225 10 38 38 38 38 83887626 371056830807 304029966339 10 38 38 38 38 83887627 370417129588 321755695865 9 38 38 38 38 83887628 369245287742 346922882082 9 38 38 38 38 83887629 370233538687 322132395112 9 38 38 38 38 83887630 369768440151 336846192959 9 38 38 38 38 83887631 369019378311 351733823479 10 38 38 38 38 83887632 370442532627 315497307231 10 38 38 38 38 83887633 370763462061 305842992907 10 38 38 38 38 83887634 369112836701 351048896466 10 38 38 38 38 83887635 36871104907 350954115324 10 37 38 38 38 83887636 369835911386 329022069717 9 38 38 38 38 83887637 369526103977 340894509478 9 38 38 38 38 83887638 369730198504 335198104312 10 38 38 38 38 83887639 36925962836 342925499742 9 38 38 38 38 83887640 369641508399 33749237703 10 38 38 38 38 83887641 369701290775 337108719426 9 38 38 38 38 83887642 36925310773 346541098262 9 38 38 38 38 83887643 364323821399 437591782904 9 37 38 38 38 83887644 366849510536 384804558398 10 37 38 38 38 83887645 368000574578 371757395575 10 37 38 38 38 83887646 36743696327 389392229051 9 37 38 38 38 83887647 366721195981 406597933483 10 37 38 38 38 83887648 367998965282 373484635736 9 37 38 38 38 83887649 369068074423 346537970313 10 38 38 38 38 83887650 36828634983 367970695764 10 37 38 38 38 83887651 367344208202 375309659394 9 37 38 38 38 838876

Average quality string(continues on next page)

63 Basic analysis of fastq files 27

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

(continued from previous page)

BBBBBFFFFFFFFFFFFFFFFFEFFFFEFEEFFEEEEEEEEEEEEEEEEEE

FILE SIZE 85M

TIME TAKEN 110m

64 Basic analysis on Alignment files (BAM)

Example Check bam headers

1 samtools view -H exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeqrarr˓Targeted RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)rarr˓AppSessionsTopHat Alignment No cSNP or RNA Editing found (36 Samples)rarr˓AppResults26970091Brain1FilesalignmentsBrain1alignmentsbam

HD VN10 SOcoordinateRG ID19 SMBrain1SQ SNchrM LN16571SQ SNchr1 LN249250621SQ SNchr2 LN243199373SQ SNchr3 LN198022430SQ SNchr4 LN191154276SQ SNchr5 LN180915260SQ SNchr6 LN171115067SQ SNchr7 LN159138663SQ SNchr8 LN146364022SQ SNchr9 LN141213431SQ SNchr10 LN135534747SQ SNchr11 LN135006516SQ SNchr12 LN133851895SQ SNchr13 LN115169878SQ SNchr14 LN107349540SQ SNchr15 LN102531392SQ SNchr16 LN90354753SQ SNchr17 LN81195210SQ SNchr18 LN78077248SQ SNchr19 LN59128983SQ SNchr20 LN63025520SQ SNchr21 LN48129895SQ SNchr22 LN51304566SQ SNchrX LN155270560SQ SNchrY LN59373566PG IDTopHat VN207 CLilluminadevelopmentIsisRNA24195IsisRNA_Toolsrarr˓bintophat --bowtie1 --read-realign-edit-dist 1 --segment-length 24 -o datainputrarr˓AlignmentsamplesBrain1replicatesBrain1tophat_main -p 1 --GTF illuminararr˓developmentGenomesHomo_sapiensUCSChg19AnnotationGenesgenesgtf --rg-id 19 --rarr˓rg-sample Brain1 --library-type fr-firststrand --no-coverage-search --keep-fasta-rarr˓order illuminadevelopmentGenomesHomo_sapiensUCSChg19SequenceBowtieIndexrarr˓genome datainputAlignmentsamplesBrain1replicatesBrain1filteredBrain1_S20_rarr˓L001_R1_001fastqgz

FILE SIZE 17M

TIME TAKEN 0006s

28 Chapter 6 Using BaseMount

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

Example Check basic statistics on a Bam file

1 bamutils stats exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targetedrarr˓RNA Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)AppSessionsrarr˓TopHat Alignment No cSNP or RNA Editing found (36 Samples)AppResultsrarr˓26970091Brain1FilesalignmentsBrain1alignmentsbam

Reads 766531Mapped 766531Unmapped 0

Flag distribution[0x010] Reverse complimented 383242 5000[0x100] Secondary alignment 47 001

Reference counts countchr1 32252chr10 6829chr11 45127chr12 74524chr13 24336chr14 81664chr15 254chr16 5704chr17 46662chr18 9922chr19 25644chr2 32527chr20 10708chr21 22chr22 7805chr3 83585chr4 42487chr5 47865chr6 106512chr7 18024chr8 6921chr9 25334chrM 0chrX 31823chrY 0

65 Metadata

In each directory BaseMount provides a number of hidden files with extra BaseSpace metadata These are hiddenfiles and their names start with a ldquordquo

1 cd exportNFSSarangaBaseSpaceProjectsMiSeq v3 TruSeq Targeted RNArarr˓Expression (NFkB amp Cell Cycle Human Brain-Liver-UHRR)Samples

2 ls -l List only the hidden files

-r-------- 1 swijeratne swijeratne 149 Nov 16 1129 curl

(continues on next page)

65 Metadata 29

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

(continued from previous page)

-r--r--r-- 1 swijeratne swijeratne 40674 Nov 16 1129 json

Display content of the json file

1 cat json

Query through a json file qith ldquojqrdquo

1 cat json | jq ResponseItems[] | select(NumReadsPF ) | Name Name PF rarr˓NumReadsPF

2 cat json | jq ResponseItems[] | select(NumReadsPF ) | ( Name)t(NumReadsPF)rarr˓

3 cat json | jq ResponseItems[] | select(NumReadsPF gt 747912) | ( Name)t(rarr˓NumReadsPF)

66 Limitations of BaseMount

According to Illumina

Every new directory access made by BaseMount relies on FUSE the BaseSpace API and the userrsquos credentials Thismechanism means that as currently available BaseMount does not support the following types of access

bull Cluster access where many compute nodes can access the files FUSE mounted file systems are per-host andcannot be accessed from many hosts without additional infrastructure

bull BaseMount also doesnrsquot refresh files or directories In order to reflect changes done via the Web GUI in yourcommand line tree you currently need to unmount (basemount ndashunmount ) and restart BaseMount

bull The Runs Files directory is not mounted automatically for you as there can be 100k + files available in thatmount which can take a couple minutes to load for really large runs You can still mount this directory manuallyif needed

bull In general lots of concurrent requests can cause stability issues on a resource constrained system Keep in mindthis is an early release and stability will increase

30 Chapter 6 Using BaseMount

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 7

Download from hudsonalphaorg

Note

Required OS OS x or Linux Windows users please contact Maria Elena Hernandez-Gonzalez

Software wget curl

Terminal emulator

bull Terminal (OS x)

bull Genome Terminal or Other Emulator (Linux)

Author This document was created by Saranga Wijeratne

71 Download

1 Create a Samplestxt file with your sample links (the links are provided in the Excelsheet) as follows

Content of the Samplestxthttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzhttpmysampledownloadorgdld4Meuliamyprojectnumberdata_150522C6V7FANXX_rarr˓s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Use the Terminal and navigate to the location where Samplestxt is saved

1 If your Samplestxt is saved under ~Downloads2 $ cd ~Downloads

31

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

3 On OS x issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do curl --progress-bar -O $f done

4 On Linux issue the following command to download your files

1 $ for f in $(cat Samplestxt ) do wget -v $f done

72 Check checksum

To detect errors which may have been introduced during the downloading you have to run checksum on your down-loaded files

1 Navigate to the location where you have downloaded your files

1 If your files are saved under ~Downloads2 $ cd ~Downloads

2 Then if yoursquore on OS x Terminal type in the following command

1 $ md5

MD5 (C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgz) =rarr˓d41d8cd428f00b204e9800998ecf8427eMD5 (C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgz) =rarr˓d49d8cdf00j204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz) =rarr˓d47d8cd98dfds0b204e9800998ecf8427eMD5 (C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgz) =rarr˓d42d8cd98f00bdfse9800998ecf8427e

If yoursquore on Linux terminal type in the following commmand

1 $ md5sum

d41d8cd428f00b204e9800998ecf8427e C6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_rarr˓D501_SL104549fastqgzd49d8cdf00j204e9800998ecf8427ed56 C6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_rarr˓D506_SL104602fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_rarr˓D501_SL104565fastqgzd47d8cd98dfds0b204e9800998ecf8427e C6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_rarr˓D508_SL104628fastqgz

Tip Match these checksum values with the values provided in the Excelsheet For any samples with mismatchingchecksum you have to re-download the samples

32 Chapter 7 Download from hudsonalphaorg

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 8

Filter a CASAVA-generated fastq file

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Illumina CASAVA-18 FASTQ Filter

Purpose This document provides instructions about how to remove Passing Filter (PF) failed reads froma Fastq file

More Read more about PF here and here

Author This document was created by Saranga Wijeratne

81 Software installation

Note If you are runing this on MCBL mcic-ender-svr please skip the installation Following command will load thesoftware module to your environment

1 $ module load fasq_filter01

On your own server

Warning If you donrsquot have administrator privileges on the machine you wouldnrsquot be able run sudo (lastcommand in the following code block) commands

33

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

1 $ wget httpcancancshledulabmembersgordonfastq_illumina_filterfastq_illumina_rarr˓filter-01targz

2 $ tar -xzf fastq_illumina_filter-01targz3 $ cd fastq_illumina_filter-014 $ make5 $ sudo cp fastq_illumina_filter usrlocalbin

Tip Put your executables in ~bin or full-path to executables in $PATH in the absence of sudo privilages

82 Filter a fastq file

Input File C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz

Output File C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

1 $ zcat C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz | fastq_illumina_filter -rarr˓vvN | gzip gt C8EC8ANXX_s2_1_illumina12index_1_SL143785filteredfastqgz

83 Filter multiple fastq files

Input File Fastq_filenamestxt

Output Files Individual Fastq files

1 Create a Fastq_filenamestxt file with your Fastq filenames in seperate lines as follows

Content of SamplestxtC6V7FANXX_s8_0_TruseqHTDual_D712-TruseqHTDual_D508_SL104628fastqgzC6V7FANXX_s3_0_TruseqHTDual_D703-TruseqHTDual_D501_SL104549fastqgzC6V7FANXX_s5_0_TruseqHTDual_D709-TruseqHTDual_D506_SL104602fastqgzC6V7FANXX_s8_0_TruseqHTDual_D705-TruseqHTDual_D501_SL104565fastqgz

2 Save the above file in the same folder with your Fastq files

3 Use the Terminal and navigate to the location where Fastq_filenamestxt is saved

1 If your Fastq_filenamestxt is saved under ~Downloads2 $ cd ~Downloads

4 Type in the following command to filter Fastqs in the Fastq_filenamestxt

1 $ for f in $(cat Fastq_filenamestxt) do zcat $f | fastq_illumina_filter -vvN |rarr˓gzip gt $ffastqgzfilteredfastqgzdone

34 Chapter 8 Filter a CASAVA-generated fastq file

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 9

Fastq adapter removal and QC

Note

Required OS OS x or Linux Windows users please contact Saranga Wijeratne

Software Trimmomatic

Purpose This document provides instructions about how to remove adapters and filter low quality basesfrom a Fastq file

More Read more about Read trimming adapter removing here

Author This document is created by Saranga Wijeratne

91 Load the software

Note If you are runing this on MCBL mcic-ender-svr following command will load the software module to yourenvironment

1 $ module load Trimmomatic322

then you can get the help how to run Trimmomatic

1 $ java -jar $TRIMHOMEtrimmomatic-033jar

92 Files needed

Input Files In put files should be in fastq formatcompressed fastq( fq fastq fqgzfastqgz) Read Introduction eg C8EC8ANXX_s2_1_illumina12index_1_SL143785fastq

35

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

C8EC8ANXX_s2_1_illumina12index_1_SL143785fastqgz s_1_1_sequencetxtgzlane1_forwardfqgz

Adapter File Currently following Adapter sequence files are hosted in MCBL server

bull TruSeq2-PEfa

bull TruSeq2-SEfa

bull TruSeq3-PEfa

bull TruSeq3-SEfa

bull NexteraPE-PEfa

Warning If you want to make your own adapter sequence file please read the The Adapter Fasta section andMaking cutome clipping files here before you make your Adapter sequence file

93 Code examples

Single-end fastq files

1 $java -jar $TRIMHOMEtrimmomatic-033jar SE -threads 12 s_1_1_sequencetxtgz lane1_rarr˓forwardfqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-SEfa23010 LEADING3rarr˓TRAILING3 SLIDINGWINDOW415 MINLEN36

Paired End Fastq Files

1 $java -jar $TRIMHOMEtrimmomatic-033jar PE -threads 12 C8EC8ANXX_s2_1_rarr˓illumina12index_1_SL143785fastqgz C8EC8ANXX_s2_2_illumina12index_1_SL143785fastqrarr˓gz C8EC8ANXX_s2_1_Trimmed_1Pfastqgz C8EC8ANXX_s2_1_Trimmed_1Ufastqgz C8EC8ANXX_rarr˓s2_2_Trimmed_1Pfastqgz C8EC8ANXX_s2_2_Trimmed_1Ufastqgz ILLUMINACLIP$TRIMHOMErarr˓adaptersTruSeq3-PEfa23010 LEADING3 TRAILING3 SLIDINGWINDOW415 MINLEN36

Multiple fastq files

Tip Assumption has been made that your data in ldquoRaw_Datardquo folder

Input Files C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzC6EF7ANXX_s3_1_illumina12index_19_SL100997fastqgz C6EF7ANXX_s3_1_illumina12index_22_SL100998fastqgzC6EF7ANXX_s3_1_illumina12index_25_SL100999fastqgz C6EF7ANXX_s3_1_illumina12index_27_SL101000fastqgzC6EF7ANXX_s3_1_illumina12index_3_SL100994fastqgz C6EF7ANXX_s3_1_illumina12index_5_SL100995fastqgzC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz C6EF7ANXX_s3_2_illumina12index_19_SL100997fastqgzC6EF7ANXX_s3_2_illumina12index_22_SL100998fastqgz C6EF7ANXX_s3_2_illumina12index_25_SL100999fastqgzC6EF7ANXX_s3_2_illumina12index_27_SL101000fastqgz C6EF7ANXX_s3_2_illumina12index_3_SL100994fastqgzC6EF7ANXX_s3_2_illumina12index_5_SL100995fastqgz

These are paired-end fastq files eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgz andC6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz belongs to single sample

Adapter File $TRIMHOMEadaptersTruSeq3-PEfa (Make sure you change this accordingly)

Output Files Each paired-end read (eg C6EF7ANXX_s3_1_illumina12index_10_SL100996fastqgzand C6EF7ANXX_s3_2_illumina12index_10_SL100996fastqgz) will give 4 outputs

36 Chapter 9 Fastq adapter removal and QC

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Pfastqgz - for paired for-wad reads

bull Q_trimmed_6EF7ANXX_s3_1_illumina12index_10_SL100996_1Ufastqgz - for unpairedforward reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Pfastqgz - for paired re-verse reads

bull Q_trimmed_6EF7ANXX_s3_2_illumina12index_10_SL100996_1Ufastqgz - for unpaired re-verse reads

1 $cd Raw_Data make sure you change the folder name accordingly2 $mkdir Trimmed_Data Output will be saved here3 $files_1=(_s3_1_fastqgz)files_2=(_s3_2_fastqgz)sorted_files_1=($(printf

rarr˓sn $files_1[] | sort -u))sorted_files_2=($(printf sn $files_2[] |rarr˓sort -u))for ((i=0 ilt$sorted_files_1[] i+=1))do java -jar $TRIMHOMErarr˓trimmomatic-033jar PE -threads 12 -trimlog Trimmed_Datalog-j3stat -phred33 $rarr˓sorted_files_1[i] $sorted_files_2[i] Trimmed_DataQ_trimmed_$sorted_files_1[i]rarr˓_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_files_1[i]-Ufastqgz Trimmed_rarr˓DataQ_trimmed_$sorted_files_2[i]_1Pfastqgz Trimmed_DataQ_trimmed_$sorted_rarr˓files_1[i]-Ufastqgz ILLUMINACLIP$TRIMHOMEadaptersTruSeq3-PE23010rarr˓LEADING3 TRAILING3 SLIDINGWINDOW420 MINLEN40 ampgtTrimmed_Datastattxt done

93 Code examples 37

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

38 Chapter 9 Fastq adapter removal and QC

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 10

DESeq2 with phyloseq

Note

Required OS OS x or Linux

Software R phyloseq R library

Purpose This document provides instructions about how to find differentially abundant OTUs for micro-biome data

More Read more about phyloseq DEseq2 here and here

Author This document was created by Saranga Wijeratne

101 Software installation

Note If you are runing this on MCBL mcic-sel019-d please skip the installation The following command inR-Studio will load the software module to your environment

1 gt library(phyloseq) packageVersion(phyloseq)2 [1] lsquo1162rsquo version

Install the phyloseq package as follows

1 gt source(httpbioconductororgbiocLiteR)2 gt biocLite(phyloseq)

39

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

102 Import data with phyloseq

For this step you need Biom and mapping files generated by the Qiime pipeline

Input biom file otu_table_mc10_w_taxbiom

Qiime mapping file mappingtxt

Output file DESeq2_Out

Copy all the input files to your ldquoWorking Directoryrdquo before you execute the following commands

1 Filenames2 biom_file lt- otu_table_mc10_w_taxbiom3 mapping_file lt- mappingtxt4

5 Import the biom file with phyloseq6 biom_otu_tax lt- import_biom(biom_file)7

8 Import the mapping file with phyloseq9 mapping_file lt- import_qiime_sample_data(mapping_file)

10

11 Merge biom and mapping files with phyloseq12 merged_mapping_biom lt- merge_phyloseq(biom_otu_taxmapping_file)13

14 Set column names in the taxa table15 colnames(tax_table(merged_mapping_biom)) lt- c(kingdom Phylum Class Order

rarr˓Family Genus species)

Now your merged mapping and Biom output should look as follows

1 merged_mapping_biom2

3 phyloseq-class experiment-level object4 otu_table() OTU Table [ 315 taxa and 9 samples ]5 sample_data() Sample Data [ 9 samples by 8 sample variables ]6 tax_table() Taxonomy Table [ 315 taxa by 7 taxonomic ranks ]

The mapping file should look like this

1 head(mapping_file)2

3 Sample Data [40 samples by 7 sample variables]4 XSampleID BarcodeSequence LinkerPrimerSequence InputFileName IncubationDate

rarr˓Treatment Description5 S1 S1 NA NA S1fasta 0

rarr˓ CO CO16 S2 S2 NA NA S2fasta 0

rarr˓ CO CO27 S3 S3 NA NA S3fasta 0

rarr˓ CO CO38 S4 S4 NA NA S4fasta 15

rarr˓ CO CO49 S5 S5 NA NA S5fasta 15

rarr˓ CO CO5

To remove taxonomy level tags assigned to each level (k__ p__ etc) issue the following commands

40 Chapter 10 DESeq2 with phyloseq

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

1 tax_table(merged_mapping_biom) lt- gsub(k__([[alpha]]) 1 tax_table( merged_rarr˓mapping_biom))

2 tax_table(merged_mapping_biom) lt- gsub(p__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

3 tax_table(merged_mapping_biom) lt- gsub(c__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

4 tax_table(merged_mapping_biom) lt- gsub(o__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

5 tax_table(merged_mapping_biom) lt- gsub(f__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

6 tax_table(merged_mapping_biom) lt- gsub(g__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

7 tax_table(merged_mapping_biom) lt- gsub(s__([[alpha]]) 1 tax_table(merged_rarr˓mapping_biom))

8 tax_table(merged_mapping_biom) lt- gsub(p__([)1 tax_table(merged_mapping_rarr˓biom))

9 tax_table(merged_mapping_biom) lt- gsub(c__([)1 tax_table(merged_mapping_rarr˓biom))

10 tax_table(merged_mapping_biom) lt- gsub(o__([)1 tax_table(merged_mapping_rarr˓biom))

11 tax_table(merged_mapping_biom) lt- gsub(f__([)1 tax_table(merged_mapping_rarr˓biom))

12 tax_table(merged_mapping_biom) lt- gsub(g__([)1 tax_table(merged_mapping_rarr˓biom))

13 tax_table(merged_mapping_biom) lt- gsub(s__([)1 tax_table(merged_mapping_rarr˓biom))

103 Testing for differential abundance among OTUs

Input file merged_mapping_biom

Output files DESeq2_Outtxt

1 Load the DESeq2 package into your R environment

1 library(DESeq2)2 packageVersion(DESeq2)3 [1] lsquo1124rsquo

2 Assign DESeq2 output name and padj-cutoff

1 filename_out lt- DESeq2_Outtxt2 alpha lt- 001

3 Convert to DESeqDataSet format The phyloseq_to_deseq2() function converts the phyloseq-format mi-crobiome data (ie merged_mapping_biom) to a DESeqDataSet with dispersion estimated using the experi-mental design formula (ie ~ Treatment)

1 diagdds lt- phyloseq_to_deseq2(merged_mapping_biom ~ Treatment)

4 Run DESeq

1 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)2

3 estimating size factors

(continues on next page)

103 Testing for differential abundance among OTUs 41

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

(continued from previous page)

4 estimating dispersions5 gene-wise dispersion estimates6 mean-dispersion relationship7 final dispersion estimates8 fitting model and testing

Warning If you are getting the following error

Error in estimateSizeFactorsForMatrix(counts(object) locfunc geoMeans =rarr˓geoMeans) every gene contains at least one zero cannot compute lograrr˓geometric meansCalls estimateSizeFactors estimateSizeFactors -gt local -gtrarr˓estimateSizeFactorsForMatrix

Then please execute the following code (see here for more info)

1 gm_mean lt- function(x narm=TRUE) exp(sum(log(x[x gt 0]) narm=narm) rarr˓length(x))

2 geoMeans lt- apply(counts(diagdds) 1 gm_mean)3 diagdds lt- estimateSizeFactors(diagdds geoMeans = geoMeans)4 diagdds lt- DESeq(diagdds test=Wald fitType=parametric)

5 Process the results

The results function creates a table of results Then the res table is filtered by padj lt alpha

1 res lt- results(diagdds cooksCutoff = FALSE)2 sigtab lt- res[which(res$padj lt alpha) ]3 sigtab lt- cbind(as(sigtab dataframe) as(tax_table(merged_mapping_

rarr˓biom)[rownames(sigtab) ] matrix)) Bind taxonomic info to final resultsrarr˓table

4 writecsv(sigtab ascharacter(filename_out)) Write `sigtab` to file

42 Chapter 10 DESeq2 with phyloseq

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 11

GBS

This documents a pipeline for the analysis of GBS (Genotyping-By-Sequencing) data

Note

Required OS OS x or Linux

Software Tassel 5

Documentation Tassel 50 Wiki

Author This document is created by Saranga Wijeratne

111 File formats

1 File formats that will be using in this analysis

bull HDF5

bull VCF

bull Hapmap

bull Plink

bull Projection Alignment

bull Phylip

bull FASTA more

bull Fastq

bull Numerical Data

ndash Phenotype Format

43

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

ndash Trait Format

ndash Covariate Format

ndash Marker Values as Numerical Co-variates

bull Square Numerical Matrix

bull Table Report

bull TOPM (Tags on Physical Map)

112 Files you need to have

The following files need to be present before you start the pipeline

1 Sequencing data files (fastq or fastqgz)

Note Fastq files should follow this naming convention (more on page 7 here) - FLOWCELL_LANE_fastqgz(eg AL2P1XXX_2_fastqgz) - FLOWCELL_s_LANE_fastqgz (eg AL2P1XXX_s_2_fastqgz) -code_FLOWCELL_s_LANE_fastqgz (eg 00000000_AL2P1XXX_s_2_fastqgz)

1 To rename a fastqgz file2 $ mv AE_S1_L001_R1_001fastqgz AL2P1XXX_1_fastqgz

2 GBSv2 key file (example key file more information)

3 A reference genome

113 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

114 GBSv2 pipeline

1 Load Tassel 50 module

44 Chapter 11 GBS

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

1 $ module load Tassel50

2 Useful commands

To check all the plugins available type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for given Plugin Ex GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options in here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Execute the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

114 GBSv2 pipeline 45

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

46 Chapter 11 GBS

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 12

A basic microbiome analysis

121 QIIME

Note

Required OS OS x or Linux

Software Qiime 19

Documentation Qiime tutorial

Author This document was created by Saranga Wijeratne

1211 File formats

This section includes descrption of varies file formats including Qiime scripts and parameters files Read more here

Qiime Script index Index of all the scripts used in Qiime

Metadata mapping files Metadata mapping files provide per-sample metadata

Tip A metadata mapping file example is given here Read the section carefully If you are planning to create themapping file by hand read this section

Biom file OTU observation file Read more here

1212 Download the files

For this tutorial we will use Mothur tutorial data from Schloss Wiki These data are 16s rRRNA Amplicons sequencedwith Illumina MiSeq

47

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

1 Create folders

Make a new directory MCICQiime and then cd to move into the dirctory

1 $ mkdir MCICQiime2 $ cd MCICQiime

2 Download data

Download data from Schloss Wiki

For this tutorial download only dataset shown in the image below (ie Example data from Scholoss lab)

Inside the MCICQiime dir issue the following command to get the data The data has been zipped and we will useunzip -j to extract all the files to same directory we are in right now

1 $ wget httpwwwmothurorgwimagesdd6MiSeqSOPDatazip2 $ unzip -j MiSeqSOPDatazip

Rename the files for downstream analyses

1 $ for f in fastq do mv $f $f_Lfastqdone

And explanation of the preceding commands

bull for f in fastq reads any file that ends with fastq one at a time

bull do starts the body of the for loop

bull mv $f do mv $f $f_Lfastq rename $f (ie F3D0_S188_L001_R1_001fastq) to$f_Lfastq (ie F3D0_S188fastq)

bull done finishes the loop

3 An explanation of the data

The files and experiment are described in the Schloss Wiki

Because of the large size of the original dataset (39 GB) we are giving you 21 of the 362 pairsof fastq files For example you will see two files F3D0_S188_L001_R1_001fastq andF3D0_S188_L001_R2_001fastq These two files correspond to Female 3 on Day 0 (ie theday of weaning) The first and all those with R1 correspond to read 1 while the second and all those withR2 correspond to the second or reverse read These sequences are 250 bp and overlap in the V4 region of

48 Chapter 12 A basic microbiome analysis

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

the 16S rRNA gene this region is about 253 bp long So looking at the files in the MiSeq_SOP folderthat yoursquove downloaded you will see 22 fastq files representing 10 time points from Female 3 and 1 mockcommunity You will also see HMP_MOCKv35fasta which contains the sequences used in the mockcommunity that we sequenced in fasta format

1213 GBSv2 pipeline plugins

Plugin DescriptionGBSSeqToTagDBPluginExecuted to pull distinct tags from the database and export them in the fastq format MoreTagExportToFastqPluginRetrieves distinct tags stored in the database and reformats them to a FASTQ file MoreSAMToGBSdbPlugin Used to identify SNPs from aligned tags using the GBS DB MoreDiscoverySNPCallerPluginV2Takes a GBSv2 database file as input and identifies SNPs from the aligned tags MoreSNPQualityProfilerPluginScores all discovered SNPs for various coverage depth and genotypic statistics for a given

set of taxa MoreUpdateSNPPositionQualityPluginReads a quality score file to obtain quality score data for positions stored in the snpposition

table MoreSNPCutPosTagVerificationPluginAllows a user to specify a Cut or SNP position for which they would like data printed MoreGetTagSequenceFromDBPluginTakes an existing GBSv2 SQLite database file as input and returns a tab-delimited file

containing a list of Tag Sequences stored in the specified database file MoreProductionSNPCallerPluginV2Converts data from fastq and keyfile to genotypes then adds these to a genotype file in VCF

or HDF5 format More

1214 GBSv2 pipeline

1 Load the Tassel 50 module

1 $ module load Tassel50

2 Useful commands

To check all the available plugins type

1 $ run_pipelinepl -Xmx200g -ListPlugins

To check all the parameters for a given plugin eg GBSSeqToTagDBPlugin type

1 $ run_pipelinepl -fork1 -GBSSeqToTagDBPlugin -endPlugin -runfork1

Tip Users are recommended to read more about GBS command line options here Page 1-2

3 File preparation

Create necessary folders and copy your raw data (fastqs) reference file and key file to appropriate folder

1 $ mkdir fastq ref key db tagsForAlign hd5

4 Commands for the pipeline

1 $ run_pipelinepl -Xmx200g -fork1 -GBSSeqToTagDBPlugin -i fastq -k keyTomato_keyrarr˓txt -e ApeKI -db dbTomatodb -kmerLength 85 -mnQS 20 -endPlugin -runfork1

2 $ run_pipelinepl -fork1 -TagExportToFastqPlugin -db dbTomatodb -o tagsForAlignrarr˓tagsForAlignfagz -c 5 -endPlugin -runfork1

(continues on next page)

121 QIIME 49

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

(continued from previous page)

3 $ cd ref4 $ bwa index -a is S_lycopersicum_chromosomes250fa5 $ cd 6 $ bwa samse refS_lycopersicum_chromosomes250fa tagsForAligntagsForAlignsai

rarr˓tagsForAligntagsForAlignfagz gt tagsForAligntagsForAlignsam7 $ run_pipelinepl -fork1 -SAMToGBSdbPlugin -i tagsForAligntagsForAlignsam -db db

rarr˓Tomatodb -aProp 00 -aLen 0 -endPlugin -runfork18 $ run_pipelinepl -fork1 -DiscoverySNPCallerPluginV2 -db dbTomatodb -sC chr00 -

rarr˓eC chr12 -mnLCov 01 -mnMAF 001 -endPlugin -runfork19 $ run_pipelinepl -fork1 -ProductionSNPCallerPluginV2 -db dbTomatodb -e ApeKI -i

rarr˓fastq -k keyTomato_key2txt -kmerLength 85 -mnQS 20 -o hd5HapMap_tomatoh5 -rarr˓endPlugin -runfork1

50 Chapter 12 A basic microbiome analysis

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

CHAPTER 13

Indices and tables

bull genindex

bull modindex

bull search

51

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

52 Chapter 13 Indices and tables

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

Python Module Index

aAbout 1

bbasemount 21basespace 9basespace-share 15

cComputingResources 6

dDataDownloading 30DESeq2-phyloseq 37

ffastq_qc 34FilterFastq 32

gGBS 42

hHome 1

mMembership 4Microbiome 45

53

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

MCBL Documentation Release 10

54 Python Module Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index

Index

AAbout (module) 1

Bbasemount (module) 21basespace (module) 9basespace-share (module) 15

CComputingResources (module) 6

DDataDownloading (module) 30DESeq2-phyloseq (module) 37

Ffastq_qc (module) 34FilterFastq (module) 32

GGBS (module) 42

HHome (module) 1

MMembership (module) 4Microbiome (module) 45

55

  • MCIC Computational Biology Lab (MCBL)
  • MCBL membership
  • MCBL servers and computing resources
  • Download from BaseSpace
  • Share runs and projects in BaseSpace
  • Using BaseMount
  • Download from hudsonalphaorg
  • Filter a CASAVA-generated fastq file
  • Fastq adapter removal and QC
  • DESeq2 with phyloseq
  • GBS
  • A basic microbiome analysis
  • Indices and tables
  • Python Module Index
  • Index