rna-based drug response biomarker discovery and pro˜ling...this document describes the data...

85
RNA-Based Drug Response Biomarker Discovery and Profiling Section 5: Data Analysis Pipeline Review

Upload: others

Post on 13-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • RNA-Based Drug Response Biomarker Discovery and Pro�lingSection 5: Data Analysis Pipeline Review

  • Application solutions: RNA Drug Response Biomarker Discovery and ScreeningRNA sequencing (RNA-Seq) is increasingly being utilized for the discovery of and profiling for RNA-based drug response biomarkers with the aim of improving the efficiency and success rate of the drug development process. While a number of technologies have been used for this application, the capabilities of RNA sequencing promise to be of particular benefit 1,3,4. Consequently, there is a growing need to extend the accessibility of RNA sequencing-based workflow solutions for this application to a broader range of potential users, including those without prior experience with next-generation sequencing (NGS).

    Towards that end, this document is designed to serve as a comprehensive resource for prospective users of any level of NGS experience who are considering adopting this application. It contains information that we have found to be particularly helpful to users across multiple stages of the process, from understanding the steps of an RNA sequencing workflow, to matching configuration options to specific program requirements, to preparing a plan for rapid navigation through the implementation process.

    Best Practices

    Analysis Pipeline Review

    Start-up Advice

    “How-to”guidance to

    facilitate work�ow implementation

    Tips from fellow application users

    and Illumina experts on how to get up

    and running quickly and smoothly

    A screenshot-based walk-through from raw data through outputs needed to inform candidate assessment and

    prioritization

    Work�ow Introduction

    ApplicationOverview

    An introduction to RNA-Seq drug

    response biomarker discovery and

    pro�ling

    Key considerations, requirements and

    recommended components for

    multiple application use-cases

  • 2

    Pipeline overview

    B. Single nucleotide variant (SNV) calling

    A. Gene fusion calling

    C. Gene expression F. Differential gene / transcriptexpression

    E. SNV enrichments

    G. Identify outlier samplesD. Fusion enrichmentsSequencer output

    Gene panelsH. Integrate w/ previous datasets (arrays)

    I. Pathway enrichment?

    RNA-SeqAlignment

    Cohort Analyzer

    Correlation Engine

    Selected biomarkers

    Feature Discovery Candidate Identification Filtering / prioritization

    I – J Investigate and rank

    J. Correlation with compound, disease, KO, tissue / cell line data?

    Clarity LIMS

    K. Sample tracking

    ● This document describes the data analysis portion of the Illumina-recommended workflow for RNA Seq-based drug response biomarker discovery

    ● The pipeline is divided into multiple workstreams; recommended tools for each as well as overviews of how they are applied are provided in the sections that follow

    ● Additional support resources for this pipeline and the complete application workflow are provided elsewhere

    For Research Use Only. Not for use in diagnostic procedures.

  • 3

    Contents

    Section Slides

    BaseSpace® Suite analysis tools: What are they and how are they used for this application? 4 - 5

    Workstream A - B: Gene fusion and SNV calling 6 - 15

    Workstream C: Gene and transcript expression quantification 16

    Workstream D: Identify gene fusion-based biomarker candidates 17 - 21

    Workstream E - F: Identify single nucleotide variant, gene expression-based biomarker candidates

    22 - 48

    Workstream G: Identify cohort outlier samples 49 - 52

    Workstream H: Integrate data with existing datasets (arrays or other) 53 - 62

    Workstream I: Functional pathway enrichment 63 - 67

    Workstream J: Correlation profiling 68 - 76

    Workstream K: Sample tracking 77 - 83

    For Research Use Only. Not for use in diagnostic procedures.

  • BaseSpace Suite Tools: Workstreams A - D

    Feature discovery

    RNA-Seq Alignment

    - Aligns raw RNA-Seq data to reference genome

    - Identifies known and novel transcriptome features for preliminary evaluation as biomarker candidates, including gene fusions, SNVs and indels

    - uantifiies gene and transcript abundance for individual samples

    The following workstreams are run within the BaseSpace Suite. Each addresses different requirements of RNA drug response biomarker discovery.

    For Research Use Only. Not for use in diagnostic procedures.

  • A

    The following workstreams are run within the BaseSpace Suite, each of which addresses different requirements of RNA drug response biomarker discovery workflows.

    - Screens for associations between RNA-Seq data and response attributes within a study cohort

    - Provides tools to stratify cohorts, analyze multiple data types, access sub ect disease and treatment history, identify outlier samples and other functions to facilitate data interpretation and cohort optimization

    - Provides tools to screen candidate biomarkers or complete molecular datasets for correlation with archived data repositories

    - Include compound response data, disease associations, gene knockout studies and functional pathway data

    Cohort Analyzer Correlation Engine

    Biomarker identification Assessment and prioritization

    BaseSpace Suite Tools: Workstreams E - J

    For Research Use Only. Not for use in diagnostic procedures.

  • orkstreams A – B Gene fusion and SNV calling

    . nder the Apps menu of BaseSpace Suite, select RNA-Seq Alignment.

    2. Click aunch

    The identification of gene fusions and SNV-based biomarker candidates from RNA sequencing data occurs during the process of aligning raw RNA sequencing reads to the reference genome. This function is performed by the RNA-Seq Alignment application.

    For Research Use Only. Not for use in diagnostic procedures.

  • . sing the pull-down windows provided, choose

    a) The name of your run

    b) The location where you will save the data

    c) The samples from your cohort to be included in the analysis

    d) The reference genome to which sequences should be aligned

    Workstreams A – B: Gene fusion and SNV calling

    Note that the Select Sample(s) menu will provide filters - including pro ect name - that can be used to locate the desired samples based on the names created at the time of run set-up in PrepTab

    For Research Use Only. Not for use in diagnostic procedures.

  • . Select your sequence aligner. This is the tool that will be used to read the raw sequence file and determine its location in the transcriptome. The available options include

    a) STAR

    b) TopHat (Bowtie )

    c) TopHat (Bowtie 2)

    Each of these options has been widely applied by RNA-Seq users but they are based on different algorithms. Note that Bowtie 2 does not enable gene fusion detection, so either STAR or Bowtie must be selected to perform this function. For guidance regarding which aligner to select, please contact an Illumina FAS Bioinformatics Specialist.

    Workstreams A – B: Gene fusion and SNV calling

    For Research Use Only. Not for use in diagnostic procedures.

  • . There are two additional analysis settings of high importance to this application, particularly for cancer therapeutics research

    a) Novel Transcript Assembly Selecting this option will ensure that novel sequence variant-based biomarker candidates (for example, indels) are captured

    b) Call Fusions ust be selected in order for gene fusions (novel and known) to be detected

    Workstreams A – B: Gene fusion and SNV calling

    For Research Use Only. Not for use in diagnostic procedures.

  • . sers have the option to customize a number of additional settings through the Advanced Options menu. No specific changes to these settings are critical for this application. For further information regarding what parameters may be modified and the potential impact on the results, please contact an Illumina FAS Bioinformatics Specialist.

    Workstreams A – B: Gene fusion and SNV calling

    For Research Use Only. Not for use in diagnostic procedures.

  • . Once all desired parameter settings are selected, the analysis can be started by clicking the Continue icon on the top right of the interface

    Workstreams A – B: Gene fusion and SNV calling

    . hen the analysis is complete, an email notification will be sent with a direct link to the results. Results can also be found under the pro ect to which the run was saved (see Pro ect list under enu)

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    . The app result screen provides a summary of key alignment statistics for each sample. Further information on these parameters as they apply to biomarker discovery can be found in the (Best Practices section)

    Workstreams A – B: Gene fusion and SNV calling

    Read length Indicates the average read length by sample for each run

    Number of reads The total number of reads that were generated for each sample

    total aligned The percentage of reads that aligned to the reference sequence / in which candidates may be captured

    abundant The portion of total reads that map to abundant sequence, such as ribosomal RNA, in which candidates are unlikely to be present

    unaligned The percentage of total reads that failed to align to the reference genome

    edian CV coverage uniformity Indicates the evenness of the distribution of read coverage across transcripts

    Stranded Indicates the percentage of reads for which high-confidence strand information is provided

    For Research Use Only. Not for use in diagnostic procedures.

  • (cont). The summary also provides graphical outputs displaying results for particular performance metrics, including

    a) Insert length distribution

    b) Alignment distribution

    c) Transcript coverage

    Workstreams A – B: Gene fusion and SNV calling

    The insert length refers to the size of the RNA template that is sequenced. For the library preparation kits recommended for the current, biomarker discovery workflows the default median inserts size is approximately bp (see respective product ser Guides)

    Displays the portion of reads aligning to each of four categories of genomic sequence. Outlier samples may indicate a technical issue.

    Provides a visual representation of the distribution of read coverage across transcripts. A shift in distribution may indicate an issue with a particular sample, such as the quality of the total RNA sample.

    For Research Use Only. Not for use in diagnostic procedures.

  • . Alignment details may also be viewed on a per-sample basis by clicking on individual sample icons on the left side of the summary page. A PDF report can be pulled by clicking the icon at the top of the page.

    Workstreams A – B: Gene fusion and SNV calling

    A broad range of alignment details (right) are provided in both table and graphical form. These data may be useful in confirming run performance upstream of the candidate evaluation and prioritization process. References for further information on each metric can be found in the BaseSpace Core Apps ser Guide

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstreams A – B: Gene fusion and SNV calling

    . (cont)

    Additional alignment details reported to inform run performance assessment. For further information on each metric please contact an Illumina FAS Bioinformatics Specialist.

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream C: Gene and transcript expression quantification

    For the purpose of identifying genes or transcripts for which abundance is associated with a drug response attribute of interest, Cohort Analyzer is recommended. The gene and transcript abundance quantification data required by Cohort Analyzer may be found in the Reference FPK values file produced by RNA-Seq Alignment (see RNA Seq Alignment results page below). This file may be downloaded and processed for upload (ref).

    For Research Use Only. Not for use in diagnostic procedures.

  • A

    Workstream D: Identify gene fusion-based biomarker candidates

    The identification of candidate gene fusion-based biomarkers involves three main steps

    . Extract fusion call reports from desired cohort samples

    2. Aggregate, pair with response attributes reported for each sample. Perform statistical analysis to identify enrichments within response groups

    A list of identified gene fusions may be pulled from RNA-SeqAlignment in multiple ways, and individual users may select which best suits their needs

    a. First, one may click the Summary report shown in the list under the Analysis Reports menu (see also slides -2 , orkstreamsA-B)

    For Research Use Only. Not for use in diagnostic procedures.

  • Aa (cont). The display to the left will appear. Scrolling to the bottom, a table labelled Fusion Calls is displayed. This is a summary display for quick reference that includes an Export option to generate an Excel spreadsheet with the following info (see below)

    Gene1 Chr1 Pos1 Str1 Gene2 Chr2 Pos2 Str2 Paired Read Split ReadANKRD37 chr4 186,320,689 - UFSP2 chr4 186,324,629 - 3 4HIST2H2AC chr1 149,858,845 - HIST2H2ABchr1 149,859,103 - 22 2DDB2 chr11 47,260,351 - ACP2 chr11 47,261,470 - 6 2

    - Name, chromosome of origin and position of each of the two fused genes

    - Count of number of paired reads (i.e. sequence aligned to each of two fused genes originated from one of two ends of a paired read) and split reads (cases in which single contiguous read contains sequence from both fused genes) that supported the fusion call

    This is a simple way to view identified fusions and may be preferred if Excel will be utilized downstream.

    Workstream D: Identify gene fusion-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • Ab. Scrolling further down the page, you will

    find a section labelled Important Files for Download . The last file listed includes fusion information.

    If TopHat Fusions was used as the fusion detection algorithm, the file will be labelled TopHat fusion output ( detected fusions).

    If STAR Aligner / the ANTA algorithm was used the label will read, ANTA Fusion output ( detected fusions)

    Workstream D: Identify gene fusion-based biomarker candidates

    The output file is an Excel spreadsheet, as shown below, which includes more detailed information about the fusion calls listed in the summary. This includes the sequence contigdriving the result as well as the score of the calls.

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    Ac. A third option is to click the Output Files link

    on the top left menu. Clicking on the name of a sample output file will display a menu as shown on the bottom right.

    In this case, the TopHat aligner was selected so the available output files for fusion data are labelled fusions and TopHat fusion.

    These links provide access to the raw output files and may be used as inputs for downstream software

    Workstream D: Identify gene fusion-based biomarker candidates

    If STAR aligner was selected, the labels will be antaFusions and fusions

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A2. Aggregate, pair with response attributes reported for each sample

    Sample ID Detected fusion Response outcomeA B C D E F G H

    A1 X X X Full A2 X X X Full A3 X Full A4 X X Full A5 X Full A6 X X Full A7 X X X Full A8 PartialA9 X PartialA10 X PartialA11 PartialA12 X X PartialA13 PartialA14 X X NoneA15 NoneA16 X NoneA17 X NoneA18 X NoneA19 X NoneA20 None

    Once fusion data have been extracted, one may compile an aggregated list that pairs the distribution of fusions present with the response attribute/s of interest for each sub ect in the cohort.

    . Run a statistical analysis to determine whether an association with a particular outcome exists

    Proc Natl Acad Sci S A. 2 ay ; (22) 2–

    Example references

    European rology (2 ) – 2

    Am J Surg Pathol. 2 Oct; ( ) -2

    Workstream D: Identify gene fusion-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • 22

    A

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    ● Feature A basic entity, such as probeset, gene, protein, SNP, metabolite, etc.

    ● Bioset A list of features and associated statistical information (e.g., p-values). A bioset typically consists of a set of features with a certain property in a given experiment (e.g., genes that respond to a compound treatment in rat liver). It may also simply be a collection of genes of interest to a researcher.

    ● Biogroup A set of genes associated by known functional property. Biogroups may contain pathways, gene ontology, protease families, sets of genes containing common cis-regulatory elements and any other sets of features associated functionally.

    ● Study Consists of a collection of biosets related to a given experiment, associated information (e.g., notes, reports, references, papers) and tags (e.g., disease or tissue studied).

    ● Pro ect A collection of studies. ibrary – high-level organization of pro ects/studies.

    Key terms and definitions

    The identification of SNVs and gene / transcript abundance results that are associated with compound response attributes within a drug trial cohort may be performed using Cohort Analyzer. Familiarity with the Cohort Analyzer terms below will be useful in performing this step

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A. pon logging into the BaseSpace Suite, the

    first screen that you will see is shown below (left). For this work stream, you will select BaseSpace Cohort Analyzer.

    2. The first screen that will appear is shown below. Here, you will begin to select a set of sub ects in which you wish to screen for biomarker candidates. By default, a group of sub ects will be selected at start-up. ou may begin your selection by clicking on Edit Sub ect Filters.

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    . The menu to the right will appear. Cohort selection filters based on a number of criteria may be applied, including

    Condition Select from a list by disease category. Next to each selection box the number of available samples is shown

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A (cont). Sample data for analysis may be

    selected based on a number of criteria, including

    Condition Select from a list by disease category. Next to each selection box the number of available samples is shown

    Project: ists pro ects that include samples meeting the criteria for the selected filters

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A

    - Cohort demographics Includes age, gender, race, ethnicity

    - Physical attributes Characteristics pertinent to disease (i.e. menopause status for breast cancer)

    - Pathology of specimen i.e. cancer type, grade, stage, etc.

    - Sub ect phenotype cancer stage

    - Sub ect history

    - Tumor molecular characteristics i.e. HER2 status in breast cancer

    - olecular datatypes Datatypes available for selected samples

    (cont). Sample data for analysis may be selected based on a number of criteria, including

    Condition Select from a list by disease category. Next to each selection box the number of available samples is shown

    Pro ect ists pro ects that include samples meeting the criteria for any selected filters

    Phenotype: Provides a list of phenotypic parameters for which filters may be applied, including

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A (cont). Sample data for analysis may be selected

    based on a number of criteria, including

    Condition Select from a list by disease category. Next to each selection box the number of available samples is shown

    Pro ect ists pro ects that include samples meeting the criteria for any selected filters

    Phenotype Provides a list of parameters for which filters may be applied

    Molecular: Allows datasets to be filtered based on molecular results. In order for a menu of particular, available molecular datatypes to appear, a gene must first be entered into the query window

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A (cont). Once a gene is selected, a menu listing

    molecular data categories and the number of sub ects for whom each data type is available is displayed. The data categories include

    - Somatic mutation

    - RNA expression

    - Copy number variation

    - DNA methylation

    - Protein expression

    - mRNA expression

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    Here, RNA expression data is selected. If desired, you can filter for genes that meet particular results criteria such as set levels of upregulation or downregulation relative to normal control

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    . Once all of the desired filters have been selected, click the Apply Filters icon

    .

    . The main menu will re-appear, now with an updated display on the top outlining the filters that have been selected and the number of sub ects meeting these criteria.

    For Research Use Only. Not for use in diagnostic procedures.

  • A. The Phenotype Summary menu also

    enables you see in a matrixed view reflecting the combined impact of the selected filters on the final sub ect cohort.

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For example, the table shown displays

    - sub ects falling within selected age filters- of these sub ects falling within the applied

    filter

    This display also allows for the selected filters to be modified / additional filters to be added prior to running your analysis

    For Research Use Only. Not for use in diagnostic procedures.

  • A. At this step, you also have the option of

    viewing additional information on individual, selected sub ects. Sub ect information can be derived by clicking the box to the left of the sub ect ID on the right side of the screen as shown here.

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A. Once a sub ect ID is selected, you can

    click on the yellow sub ect Data tab at the top of the page and access multiple types of sub ect data, including

    Subject data Includes a range of demographic info, physical attributes, sub ect history including procedures and therapies administered, etc

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A (cont). Once a sub ect ID is selected, you

    can click on the yellow sub ect Data tab at the top of the page and access multiple types of sub ect data, including

    - Sub ect data Includes a range of demographic info, physical attributes, sub ect history including procedures and therapies administered, etc

    - Molecular data Provides a list and access to sub ect data for RNA expression, CNV, DNA methylation and somatic mutations generated across a variety of platforms. Here, a list of genes for which RNA expression data is available is shown and for each, the abundance value relative to a normal reference control

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A (cont). Once a sub ect ID is selected, you

    can click on the yellow sub ect Data tab at the top of the page and access multiple types of sub ect data, including

    - Sub ect data Includes a range of demographic info, physical attributes, sub ect history including procedures and therapies administered, etc

    - olecular data Provides a list and access to sub ect data for RNA expression, CNV, DNA methylation and somatic mutations generated across a variety of platforms. Here, a list of genes for which RNA expression data is available is shown and for each, the abundance (RPK ) value relative to a normal reference control

    - Timeline Provides timeline information on a number of variables, including when samples taken for molecular analysis were collected, treatments administered and histological outcomes. Here, therapeutic history is shown

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • AWorkstream E – Identify SNV-based biomarker candidates

    . Once a cohort of interest has been selected, there are a number of tools available to facilitate the identification of somatic variant biomarker candidates

    The first two options are available through the Genome- ide Analysis menu

    Selecting the arker Frequency option, then the Somatic mutation tab will display genes in which variants in categories with high risk of biological impact have been reported – in order of combined frequency - within the selected subset of sub ects. These categories include

    - a or deletion / rearrangement- Stop codon gain or loss

    - Splice site- Indels- issense or likely missense

    For Research Use Only. Not for use in diagnostic procedures.

  • A (cont). Clicking the Graph icon on the

    far left of any row of the report (see below) will generate the display shown on the right, which includes

    - The distribution of variants, by category, reported in the identified gene within the selected subset of sub ects

    - For reference, the distributions reported across other conditions (i.e. in other forms of cancer)

    Workstream E – Identify SNV-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A (cont). Alternatively, one may also screen for

    enrichment for particular SNPs within the selected sub ect subset by clicking the SNPicon under the Somatic mutation tab

    A similar report is generated, sorted by rank order of mutation frequency within the selected sub ect subset.

    Clicking on the Graph icon will generate a allele distribution report for the variant vs the reference allele

    Workstream E – Identify SNV-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A (cont). In this case, the top enriched

    SNPs show a darkened (selectable) icon for Known biomarker. This result indicates

    that the particular SNP has been previously reported as a candidate biomarker for at least one outcome.

    As shown to the right, the report provides a list of conditions / indications for which an association with this SNP has been reported. For example, this particular SNP has been reported associated with a number of outcomes related to breast cancer

    Workstream E – Identify SNV-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A (cont). A third option is to screen by Codon , or variants that encode for a

    change that will impact the process of translation

    A similar Graph result - this time reporting for aggregated variants impacting the codon of interest , may also be generated (below).

    Workstream E – Identify SNV-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A (cont). Additional reports may be generated

    using the Plot menu. A arker Frequency Plot displaying the frequency of variants across the selected sub ect subset is shown below.

    A Co-occurrence Plot may also be generated, displaying the distribution of identified variants across the selected sub ect subset including cases in which multiple variants occur in a single individual

    Workstream E – Identify SNV-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • AWorkstream F – Identify GEX-based biomarker candidates

    . Similar analysis to those described in Step can be applied to the identification of candidate RNA biomarkers.

    Selecting the arker Frequency option, then the RNA Expression tab will prompt a screen for genes for which a difference in measured abundance relative to normal control is enriched within the selected subset of sub ects. A filter for the degree of this difference can be selected under Advanced Options.

    The resulting list of genes is displayed. For each, the portion of sub ects for which the abundance filtering criteria above were met across the cohort is indicated.

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    A

    The top graph displays the distribution of expression values across the selected cohort. As indicated in the prior slide, this particular gene is downregulated by the set

    2x threshold in all sub ects within the selected subset.

    (cont). From the same menu one may select the Graph icon to generate the output shown here

    Workstream F – Identify GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A

    (cont). The third graph displays the relative expression results for the gene of interest across archived samples for other conditions. Shown here is the distribution of observed upregulation, downregulation and lack of change measured for each.

    Workstream F – Identify GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A

    Displayed are sets of genes that are found to be similarly, differentially regulated across the selected subset of sub ects. Note that genes that are both downregulated (blue) and upregulated (red) relative to control are included.

    The display also provides the portion of the selected subset for which the co-occurrence result holds. For the genes displayed here, the result applies across the complete subset of sub ects.

    (cont). Additional data outputs may be generated from the analysis as well. Selecting the Co-occurrence Plot from the menu on the top left generates the report shown here.

    Workstream F – Identify GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A

    In the example here, a subset of the selected sub ects shows both decreased expression and increased methylation in the gene SDPR

    (cont). As an extension of this analysis one may also generate a ulti Co-occurrence Plot, which overlays additional forms of molecular data (somatic variants, CNVs, methylation) for the identified genes that are available from the same sub ects

    Workstream F – Identify GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A

    . In addition to the options under the Genome- ide Analysis menu, the olecular Comparison tool under Group Comparison can be leveraged for RNA biomarker identification

    The first step is to select the Group Comparison Setup icon.

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    For Research Use Only. Not for use in diagnostic procedures.

  • A (cont). The top left tab enables comparisons

    between groups fitting particular outcome criteria. For example, sub ects showing complete vs partial response to a candidate compound. Once the attributes of interest are selected, click Next.

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    On the next screen, the selected attributes will appear in the left window and may be dragged and dropped into boxes for Group A and Group B. As filters are added, a preview summary shows the number of sub ects meeting the selection criteria for each group. Once the set-up is complete, click Apply

    For Research Use Only. Not for use in diagnostic procedures.

  • AClicking the graph icon, the below output is generated. Here, the distribution of fold change values vs control across each of the two queried sub ect groups is shown.

    ou may also click on Summary to view additional details about the gene as well as a link to further query this gene in Correlation Engine.

    Workstreams E&F – Identify SNV, GEX-based biomarker candidates

    (cont). The output of this comparison is a list of genes for which expression level differs between the queried subsets of sub ects. The degree of the difference and associated significance values are provided for each. If the gene has been previously reported as a biomarker, this will be indicated as well.

    For Research Use Only. Not for use in diagnostic procedures.

  • Cohort Analyzer may be used to screen sub ect cohorts for genes that are expressed at markedly higher or lower levels within individual sub ects and/or subsets of sub ects within the selected population. This information may be useful in informing data interpretation, as well as stratification of the drug trial cohort.

    Workstream G – Identify outlier samples

    . Select the sub ect cohort in which outliers are to be identified. The process of identifying and filtering biosets for analysis is similar to that described for orkstream D-E (Identifying SNV and GE -based biomarker candidates). Here, the same filters applied for that workstream example have been used. sub ect data from imported, in-house pro ects or archived data may be mined, and both outcome and therapy-based filters applied.

    For Research Use Only. Not for use in diagnostic procedures.

  • 2. Once the cohort to be analyzed is selected, clicking the Outlier Analysis tab under Genome-wide Analysis The RNA expression tab below will automatically be selected, and the analysis will initiate.

    Once completed, the output shown will be displayed. The data provided include

    - eta outlier score Provides a graphical representation of the degree of the detected expression difference in outlier samples vs the complete cohort

    - FDR False discovery rate calculated for outlier value within the cohort

    - Product Rank P-value evel of statistical significance of difference between outlier value and cohort

    - edian COPA Provides outlier score based on COPA (Cancer Outlier Profile Analysis) criteria.

    Workstream G – Identify outlier samples

    For Research Use Only. Not for use in diagnostic procedures.

  • 2 - cont. Clicking the icon on the far left will produce a graphical result output. There are two graphical options available

    aterfall graph For each bioset analyzed, the relative expression within the sub ect subset that drove the outlier result vs the complete cohort is shown.

    Summary graph Provides name and chromosomal position information on the identified gene, as well as a link for additional gene info via Correlation Engine .

    Workstream G – Identify outlier samples

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    . Another available graphical result output is the eta Analysis Heatmap (below). For each of the

    top, identified outlier genes, the graph displays- The degree of differential expression vs the cohort

    - The normalized rank score of the expression of that gene across each of the biosets included in the analysis

    Workstream G – Identify outlier samples

    . The interface also allows for results to be filteredbased on the percentile score of the COPA values

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream H – Data integration

    The following steps outline how functionality within Correlation Engine addresses each of these needs by enabling meta-analysis of datasets generated ) using either the same or multiple platforms and 2) as part of any internally-generated or archived study.

    Once RNA sequencing data have been generated for a cohort, it may be useful to integrate these data with existing datasets for one or more reasons

    - If your program is transitioning from legacy platforms to NGS, it may be necessary to combine data generated using RNA sequencing and GE arrays, qPCR or other platforms into a single biomarker candidate screen

    - Alternatively, molecular data signatures observed within a cohort may be represented in other datasets, either from similar study designs (i.e. trial for similar compound targeting the same lead indication) or from an unrelated study for which a link could provide new insights to inform the candidate evaluation process.

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream H – Data integration. Select pro ects from which you wish to include biosets in your meta-analysis.

    Biosets can be located through a search query. hen a pro ect category is selected, a list of individual pro ects – and the number of available biosets under each – is displayed.

    For Research Use Only. Not for use in diagnostic procedures.

  • 2. Clicking on the number of biosets within a given pro ect will prompt for the bioset details to be displayed.

    . Any bioset of interest (irrespective of the platform on which it was generated) may be added to a meta-analysis for comparison by clicking the Add to meta-analysis icon to the left.

    Workstream H – Data integration

    Clicking on the eta-analysis icon on the top right will then display an aggregate list of all selected biosets.

    For Research Use Only. Not for use in diagnostic procedures.

  • . The selected biosets may then be compared using multiple query terms including

    a) Gene results Identifies genes for which expression results correlate across query biosets

    b) Biogroup results Identifies functional pathways for which enrichment among differentially regulated genes correlates across query biosets

    c) Bioset results Identifies additional curated biosets for which GE correlates with the query biosets

    a. The below shows the results of a Gene Results query, which identifies genes for which expression correlates across the selected biosets. Included, from left

    a) A visualization of the expression of each identified gene across the query biosets

    b) Score and specificity of result (ie for which fraction of selected biosets does correlation hold)

    c) Checkbox to generate graph to view / export

    Workstream H – Data integration

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream H – Data integration

    b. Clicking on the blue arrow next to each result will prompt the below display providing expression and correlation metrics for the identified gene across each of the individual biosets.

    c. Clicking the blue triangle to the left of each bioset result will then display the RNA expression details for each.

    For Research Use Only. Not for use in diagnostic procedures.

  • d. Clicking on a red square next to a result, then NextBio Summary will display additional, detected correlations between the query bioset and archived biosets across categories within Correlation Engine, including

    - Body Atlas Genes that are enriched or expressed in specific tissues, cell types, cell lines and stem cells

    - Pharmaco Atlas Genes, sequence regions, biogroupsand biosets affected by particular compounds and treatments.

    - Pathway Enrichment Functional pathways for which genes in query bioset, phenotype or compound is highly enriched.

    - Disease Atlas Genes, sequence regions, SNPs, biogroups, or biosets associated with diseases, traits, conditions, and surrogate endpoints

    - Knockdown Atlas Genes, sequence regions, biogroups, or biosets affected by knockdown, knockout or overexpression experiments

    Workstream H – Data integration

    For Research Use Only. Not for use in diagnostic procedures.

  • . Alternatively, Biogroups results provide

    a) The expression profile of genes driving the functional pathway enrichment detected across the query biosets

    b) The score and specificity of the correlation

    c) Checkbox to generate graph to view / export

    Workstream H – Data integration

    a. Clicking on the blue arrow to the left of each result displays results data for the individual bioset. This includes

    a) A visualization of the expression of each identified gene across the query biosets

    b) Scoring for strength and specificity of correlation

    c) Checkbox to generate graph to view / export

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream H – Data integration

    b. Clicking on the blue triangle next to each bioset result will prompt a visual display showing the differentially expressed genes driving the result and details regarding the corresponding functional pathway (biogroup) enrichment.

    - Total number of genes represented in each, and distribution of directional regulation

    - Visual summary of gene overlap between biosets and statistics on the bioset – to – bioset correlation

    - A gene by gene breakdown of fold change across biosets

    For Research Use Only. Not for use in diagnostic procedures.

  • . Biosets Results provide similar output

    a) Expression profile of genes driving detected correlation between query and identified bioset

    b) The score and specificity of the correlation

    c) Checkbox to generate graph to view / export

    Workstream H – Data integrationa. Clicking on the blue arrow to the left of

    each result displays results data for the individual bioset. This includes

    a) A visualization of the expression result of the identified gene across each query bioset

    b) Scoring for strength and specificity of correlation

    c) Checkbox to generate graph to view / export

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    b. A comparison summary of the selected query bioset vs correlated biosets is displayed. This includes

    - Total number of genes represented in each, and distribution of directional regulation

    - Visual summary of gene overlap between biosets and statistics on the bioset – to – bioset correlation

    - A gene by gene breakdown of fold change across biosets

    Workstream H – Data integration

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream I – Functional pathway enrichment

    The biomarker candidate assessment / prioritization process may benefit from visibility to cases wherein induction or downregulation of a functional pathway, rather than ust a single gene is predictive of a response attribute. Such a finding may impact how or whether a result or set of results is pursued.

    This step may be performed as part of a meta-analysis (using multiple biosets as a single query) as highlighted in orkstream H, Data Integration (slides - 2). It may also be run on an individual cohort as reviewed here.

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream I – Functional pathway enrichment

    . Select a bioset for which functional pathway enrichment will be queried.

    It may be an internal bioset (i.e. drug trial cohort) or an archived bioset. For example, one might query for a result of interest, select a pro ect in which the result was observed then select a biosetfrom that pro ect.

    For Research Use Only. Not for use in diagnostic procedures.

  • - Name of functional pathway in which enrichment was identified

    - The score, p value and number of common genes driving the result

    - The direction of pathway regulation

    2. Once a bioset of interest is selected, click the Pathway Enrichment icon on the top menu.

    The resulting output will display the following

    Workstream I – Functional pathway enrichment

    For Research Use Only. Not for use in diagnostic procedures.

  • . Clicking the blue arrow to the left of any result will prompt an enrichment summary to be displayed. This includes

    - Total number of genes in the query bioset that overlaps with genes included in the identified functional pathway.

    - Visual summary and p value of the enrichment

    - A list of individual overlapping genes with rank of correlation and fold change

    Workstream I – Functional pathway enrichment

    For Research Use Only. Not for use in diagnostic procedures.

  • . Clicking the red box to the right of the result will display a NextBio Summary, which lists data categories in which study results show enrichment for the same functional pathway.

    Clicking on an individual result will prompt the display below with details on the strength of the enrichment, supporting molecular data types, number of supporting studies and direction of regulation

    Workstream I – Functional pathway enrichment

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream J – Correlation profiling

    In vetting molecular results for consideration as biomarker candidates, it may be informative to learn whether similar results have been observed in previous trials or other studies. For example, was a gene or set of genes also upregulated in response to a previously-tested candidate compound, in sub ects presenting with another indication, or under experimental conditions for which similarity in results may not have been expected?

    The following section outlines how to query an individual result or bioset against data archived within the Correlation Engine. Correlation screening steps are also described in orkstreamG, Data Integration, which outlines how this analysis can be performed as part of a meta-analysis.

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream J – Correlation profiling

    . Select a bioset to query. One can submit internal data (i.e. results from trial) or any archived dataset of interest.

    Curated Studies:

    uery or browse all studies curated by Illumina. ou can query by gene, SNP, sequence region, biogroup, bioset, phenotype, compound, tissue, or keyword. Or browse using filters and text-based search.

    Body Atlas:

    View the tissues, cell types, cell lines and stem cells in which a queried gene, bioset or biogroup is significantly enriched or expressed. Or view genes that are enriched or expressed in specific tissues and biosources Disease Atlas

    Pharmaco Atlas:

    Discover which compounds and treatments affect a queried gene, sequence region, biogroup, or bioset

    Knockdown Atlas:

    Perform a knockdown, knockout, or overexpression experiment in reverse See which genetic perturbations affect a queried gene, sequence region, biogroup, or bioset.

    Genetic Markers:

    ocate genes and SNPs that are significantly linked to a queried phenotype or compound

    Pathway Enrichment:

    Find biogroups for which your queried bioset, phenotype or compound is highly enriched.

    Identify Bioset correlations

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream J – Correlation profiling

    . Select a bioset for which functional pathway enrichment will be queried.

    It may be an internal bioset (i.e. drug trial cohort) or an archived bioset. For example, one might query for a result of interest, select a pro ect in which the result was observed then select a biosetfrom that pro ect.

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream J – Correlation profiling

    2. Once a bioset is selected, click the category of result for which you wish to identify a correlation. The following steps will highlight examples of output from some of these query categories.

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    Workstream J – Correlation profiling. Body Atlas query

    This query will identify cell / tissue types, including cell lines, for which a correlation was identified with the molecular results of the query bioset. Provided are the score (strength of correlation, supporting data types and the P-value of the correlation).

    Clicking the blue arrow at the left of each result will prompt the visual correlation summary shown here.

    For Research Use Only. Not for use in diagnostic procedures.

  • . Disease Atlas query ill display a list of studies by disease category for which a molecular data result was found to be correlated with the query bioset. The result output includes similar information to that provided for the Body Atlas query.

    Workstream J – Correlation profiling

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream J – Correlation profiling

    . Pharmaco Atlas query ill display drug study categories for which response results correlated with the query bioset.

    Selecting a result will prompt a display as shown below providing details and links to the studies in which correlations were identified.

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream J – Correlation profiling

    . Knockdown Atlas query ill display gene knockdown experiments for which molecular results were found to be correlated with the query bioset. The result output is a gene list with corresponding strength of correlation, supporting data types, number of supporting studies and direction (positive or negative) of the detected correlation.

    For Research Use Only. Not for use in diagnostic procedures.

  • Identify candidate biomarker correlations

    Workstream J – Correlation profiling

    . Clicking on an individual study of interest will display a list of biosetswithin it with which differential expression across sub ects was identified. For each, multiple values are provided

    Score and p value of change

    Result type

    Result value

    This function allows for any initial correlations identified for the candidate to be further vetted and explored, providing contextual data on which to base an evaluation

    For Research Use Only. Not for use in diagnostic procedures.

  • Workstream K – Sample tracking

    hen managing an application workflow within the context of a drug development program it is essential that the history of each sample – including the source, how DNA / RNA was extracted, for what pro ects it was used – is reliably tracked, and that the data associated with each sample is properly catalogued. For many programs, this need may be best addressed through the use of aboratory Information anagement Systems ( I S).

    The following workstream illustrates several such capabilities of I S.

    For Research Use Only. Not for use in diagnostic procedures.

  • At the beginning of the project, each sample is cataloged in LIMS such that every attribute of the sample itself and its history is recorded.

    Workstream K – Sample tracking

    At the start of a pro ect, each sample is cataloged in I S. A range of attributes is recorded, including details related to the study design as well as information on equipment and consumable reagents that were used to process it.

    For Research Use Only. Not for use in diagnostic procedures.

  • At the beginning of each experiment, the user then selects which of a predefined set of workflows will be run and associates that information with the scoped samples. In this example, multiple RNA sequencing workfows are shown

    Workstream K – Sample tracking

    One of a predefined set of workflows that will be run for a given pro ect is selected. ultiple workflows may be stored within a given method category, such as the subtypes of RNA sequencing workflows shown here.

    For Research Use Only. Not for use in diagnostic procedures.

  • Here, inputs for multiple, available QC assays are displayed and those for which data are available for a given project are populated

    Workstream K – Sample tracking

    For the selected workflow, a quality control assay may be selected

    For Research Use Only. Not for use in diagnostic procedures.

  • LIMS then tracks the sample set through each step of the library preparation process, as shown here

    Workstream K – Sample tracking

    Once the workflow is started, I S will manage the automated steps for which they were programmed. Shown here is an example of real-time tracking of the library preparation process for an RNA sequencing workflow.

    For Research Use Only. Not for use in diagnostic procedures.

  • 2

    LIMS will track the history of each sample through the library preparation process, including lot information for the reagents used to prepare it, barcodes for the plates in which it was included and experimental paramaters – such as fragmentation time – that were applied to it.

    Workstream K – Sample tracking

    The progress and history of each sample is continually tracked throughout the process. Prompts to populate such details as the barcodes of the plates in which a sample was processed will be displayed as shown here.

    For Research Use Only. Not for use in diagnostic procedures.

  • After library preparation, LIMS will track each sample through the sequencing process down to the details of individual steps in the workflow as shown here. As a result, a comprehensive history of each sample in the cohort is captured and recorded.

    Workstream K – Sample tracking

    I S will track each sample through each step of the sequencing process. As with the library preparation step, a detailed sample history is captured and recorded for subsequent review.

    For Research Use Only. Not for use in diagnostic procedures.

  • Illumina • 1.800.809.4566 toll-free (US) • +1.858.202.4566 tel • [email protected] • www.illumina.com

    For Research Use Only. Not for use in diagnostic procedures.

    © 2017 Illumina, Inc. All rights reserved.

    References1. Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X. Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014;9(1):e78644. doi:10.1371/journal.

    pone.0078644.

    2. Wang ZL, Zhang CB, Cai JQ, Li QB, Wang Z, Jiang T. Integrated analysis of genome-wide DNA methylation, gene expression and protein expression profiles in molecular subtypes of WHO II-IV gliomas. J Exp Clin Cancer Res. 2015;34:127. doi: 10.1186/s13046-015-0249-z.

    3. Atak ZK, Gianfelici V, Hulselmans G, et al. Comprehensive analysis of transcriptome variation uncovers known and novel driver events in T-cell acute lymphoblastic leukemia. PLoS Genet. 2013;9(12):e1003997.

    4. Kumar-Sinha C, Kalyana-Sundaram S, Chinnaiyan AM. Landscape of gene fusions in epithelial cancers: seq and ye shall find. Genome Med. 2015;7:129.

    5. Ishikawa R, Amano Y, Kawakami M, et al. The chimeric transcript RUNX1–GLRX5: a biomarker for good postoperative prognosis in Stage IA non-small-cell lung cancer. Jpn J Clin Oncol. 2016;46(2):185-189.

    6. Lu L, Zhang H, Pang J, Hou G, Lu M, Gao X. ERG rearrangement as a novel marker for predicting the extra-prostatic extension of clinically localised prostate cancer. Oncol Lett. 2016;11(4):2532-2538.

    7. Perez-Gracia JL, Sanmamed MF, Bosch A, et al. Strategies to design clinical studies to identify predictive biomarkers in cancer research. Cancer Treat Rev. 2017;53:79-97.

    8. Kantae V, Krekels EHJ, Esdonk MJV, et al. Integration of pharmacometabolomics with pharmacokinetics and pharmacodynamics: towards personalized drug therapy. Metabolomics. 2017;13(1):9.

    9. Fang B, Mehran RJ, Heymach JV, Swisher SG. Predictive biomarkers in precision medicine and drug development against lung cancer. Chin J Cancer. 2015;34(7):295-309.

    10. Zhao X, Modur V, Carayannopoulos LN, Laterza OF. Biomarkers in Pharmaceutical Research. Clin Chem. 2015;61(11):1343-1353.

    11. Mishra PJ. Non-coding RNAs as clinical biomarkers for cancer diagnosis and prognosis. Expert Rev Mol Diagn. 2014;14(8):917-919.

    12. Costa C, Giménez-Capitán A, Karachaliou N, Rosell R. Comprehensive molecular screening: from the RT-PCR to the RNA-seq. Transl Lung Cancer Res. 2013;2(2):87-91.

    13. Perkins JR, Antunes-Martins A, Calvo M, et al. A comparison of RNA-seq and exon arrays for whole genome transcription profiling of the L5 spinal nerve transection model of neuropathic pain in the rat. Molecular Pain. 2014;10:7.

    14. Brewer CT, Chen T. PXR variants: the impact on drug metabolism and therapeutic responses. Acta Pharmaceutica Sinica B. 2016.

    15. Bracco L, Kearsey J. The relevance of alternative RNA splicing to pharmacogenomics. Trends Biotechnol. 2003;21(8):346-353.

    16. Barrie ES, Smith RM, Sanford JC, Sadee W. mRNA Transcript diversity creates new opportunities for pharmacological intervention. Mol Pharmacol. 2012;81(5):620-630.

    17. Ling H, Fabbri M, Calin GA. MicroRNAs and other non-coding RNAs as targets for anticancer drug development. Nat Rev Drug Discov. 2013;12(11):847-865.

    18. Rönnau CG, Verhaegh GW, Luna-Velez MV, Schalken JA. Noncoding RNAs as novel biomarkers in prostate cancer. Biomed Res Int. 2014;2014:591703.

    19. Moorman AV. New and emerging prognostic and predictive genetic biomarkers in B-cell precursor acute lymphoblastic leukemia. Haematologica. 2016;101:407-416.

    20. Nalejska E, Mączyńska E, Lewandowska MA. Prognostic and predictive biomarkers: tools in personalized oncology. Mol Diagn Ther. 2014;18(3):273-284.

    21. Shang C, Guo Y, Zhang H, Xue YX. Long noncoding RNA HOTAIR is a prognostic biomarker and inhibits chemosensitivity to doxorubicin in bladder transitional cell carcinoma. Cancer Chemother Pharmacol. 2016;77(3):507-513.

    22. McCleland ML, Mesh K, Lorenzana E, et al. CCAT1 is an enhancer-templated RNA that predicts BET sensitivity in colorectal cancer. J Clin Invest. 2016;126(2):639-652. doi:10.1172/JCI83265.

    23. Zhou M, Ye Z, Gu Y, Tian B, Wu B, Li J. Genomic analysis of drug resistant pancreatic cancer cell line by combining long non-coding RNA and mRNA expression profiling. Int J Clin Exp Pathol. 2015;8(1):38-52.

    24. Zhao W, He X, Hoadley KA, Parker JS, Hayes DN, Perou CM. Comparison of RNA-seq by poly (A) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. BMC Genomics. 2014;15:419.

    25. SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014:32(9):903-914.