Download - Jillian ms defense-4-14-14-ja-novid2
Bacterial Gene Neighborhood Inves5ga5on Environment: A
Scalable Genome Visualiza5on for Big Displays
Jillian Aurisano Master of Science Defense
April 16, 2014
Science has historically looked like this:
Up un5l very recently
“Observa)ons!”
Exper5se
explore, make observa5ons
Collect samples
“No one looks under a microscope anymore. Its all DNA. ”
How do scien)sts make discoveries?
How do we bring experts into the loop?
• From direct collec5on of data, direct observa5on of results direct interpreta5on and analysis
• To automated data collec5on, automated filtering and automated analysis
• Need visualiza5on to bring experts into the loop
• But how do we handle big data?
• What’s our Big Data microscope?
“ Picard: Computer; scan everything, run diagnos5cs, and tell us the
answer.”
“Computer: Results are inconclusive”
Can Big Displays help?
• Evidence suggests that these environments can have a posi5ve impact on percep5on and cogni5on
• But how do we use them to effec5vely address big data problems?
• Can exis5ng visualiza5ons simply be ‘scaled-‐up’ to fit or are new approaches needed?
In this thesis I will… Examine a specific big data visualiza5on problem: compara5ve gene neighborhood analysis in bacterial genomics I worked closely over several years with a team of computa5onal biologists This work has led to the design and implementa5on of a new visualiza5on approach designed to scale to big data and big displays
BactoGeNIE (‘Bact(o)erial Gene Neighborhood Inves5ga5on
Environment’)
Outline 1) Describe compara5ve bacterial gene
neighborhood analysis to understand how to bring experts into the loop
2) Examine poten5al impact of Big Displays on Big Data visualiza5on
3) Evaluate scalability in exis5ng compara5ve genomics visualiza5ons
My work: BactoGeNIE 4/5/6) Describe my design, implementa5on, results 7) Think about the future In the process, learn something about scaling up visual approaches to big data and big displays
Warning: Biology is used in this thesis!
Genome sequencing boom • Sequencing costs decreasing faster than Moore’s Law
• So, we are able to produce massive volumes of sequence data
• Bacterial genomes are small, so we are genera5ng thousands of complete bacterial genome sequences Wejerstrand K.A., DNA Sequencing Costs: Data from the NHGRI Large-‐
Scale Genome Sequencing Program, 2012 <www.genome.gov/sequencingcosts>
What is a genome? What is a gene?
• Genomes consists of one or more long molecules of ‘DNA’
• DNA consists of chained nucleo5de molecules (A, C, T, G) also called ‘base pairs’
• All the genes in an organism are in its ‘genome’
• Genes determine traits in an organism
• Genes ‘code’ for proteins, and proteins do the work to make traits happen
How are genomes sequenced? • Sequencing • Assembly • Annota5on • Output: – Genome feature files
– Raw sequence files
Michael Schatz Cold Spring Harbor
Lots of genome sequences-‐> opportunity
Big challenge: Hard to figure out what a novel gene does • Tradi5onally: do wet-‐lab research to figure out – but expensive, 5me-‐consuming
• Sequence the gene, and use computa5onal methods to predict the func5on of the protein – If novel gene, may not provide answer
• Can complete genome sequences help? • Compara5ve gene neighborhood analysis
From genome structure to gene-‐product func5on
• In bacteria, genes whose products are involved in similar func5ons onen placed close to each other in the genome.
• Research suggests that it is possible to predict gene-‐product func5on in bacteria based on commonly recurring gene neighbors
• But, need to examine lots of genomes for sta5s5cal significance?
gene1 gene2 gene3 gene4
Biological process
?
Comparing gene neighborhoods across different genomes
• Genes with similar sequences likely produce proteins with similar func5ons
• Orthologs: similar genes from different genomes • Algorithms to compare genes between different genomes
DeMeo et al. BMC Molecular Biology 2008 9:2 doi:10.1186/1471-‐2199-‐9-‐2
Role for visualiza5on in this problem
• Why not use automated methods to find common sets of genes around gene targets?
• Why visualiza5on? • 3 E’s: Explora5on, Exper5se, Errors
Automated methods:Target: gene B
Common subsequences:Strains 1, 2, 3: {A, B, C, D}
• Pajerns and anomalies without knowing in advance what you are looking for
Explora5on
Automated methods:Target: gene B
Common subsequences:Strains 1, 2, 3: {A, B, C, D}
Duplication
Strain 1
Strain 2
Strain 3
A B D
A
A
C
CC
D
D
B C
CBB
B
Truncation
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
B C
Deletion
Strain 1
Strain 2
Strain 3
A B
C
D
A
A
C
D
D
B
B
Inversion
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
CB
Exper5se
• Experts make connec5ons that will be missed by automated methods – Not just the anomaly, but significance of the anomaly – Knowledge about strains, protein families involved in finding significant anomalies
StrainA
StrainB
StrainC
!
Errors
• Verify automated methods
• Uncertainty and errors in data genera5on
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Common subsequences:Strains 1 and 3: {A, B, C, D}Strain 2: {A, D}
Ground truth
Strain 1
Strain 2
Strain 3
A B C D
A B C D
A
A B C
D
D
A
A B C
D
D
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Common subsequences:Strains 1 and 3: {A, B, C, D}Strain 2: {A, B}
Ground truth
Strain 1
Strain 2
Strain 3
Strain 2
A B C
Breaks in assembly Missed gene boundaries
To address this problem:
• Visualiza5on must help bring experts into the data mining loop 1) Helps experts iden5fy sources of error 2) Allows experts explore the data 3) Enable researchers to integrate exper(se in data
analysis So: overview visualiza5on not enough. Need gene-‐neighborhood details
• Visualiza5on must scale to enable comparisons between hundreds to thousands of genomes
Big displays: Opportunity for big data?
• The ques5on is: can these environments be used to visualize big data sets bejer?
• Evidence suggests yes: – Physical naviga5on over virtual naviga5on
• Reduced need pan and zoom • Reduced need for context switching • U5lize embodied cogni5on • Mul5ple levels-‐of detail accessible through physical movement
– Externalize more informa5on that can be accessed simultaneously
Lance Long
Por5ng from small to big displays
• Maybe por5ng genome visualiza5ons to these environments is sufficient?
• Ruddle2013: – Export high-‐resolu5on graphical output from exis5ng genomics visualiza5ons
– Display these large images on big display – Evidence that this had a posi5ve impact on researcher reasoning
• However, effec5ve visualiza5on on big displays involves more than simply scaling up the representa5on
Pixel-‐Density Scalability
• As pixel-‐density increases, does a visual approach take advantage of increased pixels-‐per-‐inch to show more en55es, rela5onships or to show data at higher detail
Evalua5on: • High-‐Density Representa5on? • use increased pixels per inch to show more en55es and
rela5onships at higher detail?
• Simultaneous detail and overview? • With increased pixel density, representa5on shows details
and overviews at the same 5me, without relying on Focus+Context
Display-‐Size Scalability
• As display size increases, does a visual approach take advantage of the increased space to depict more en55es or rela5onships?
Evalua5on • Encode big data spa5ally • Cluster related elements: • spa5al memory • direct, visual comparisons
• Physical naviga5on over virtual naviga5on: • Overviews at a distance, details up-‐close
Perceptual and Analy5c Task Scalability
• Does a visual approach scale up to enable the performance of an analy5c task across more data, more space, more pixels.
• Does percep5on suffer if you scale the approach up?
• Analy5c tasks performed pre-‐ajen5vely • Analy5c tasks aided by visual queries • Aids to visual search for performing analy5c tasks
Examining current genomic data visualiza5ons
• Does it address this problem? • Show gene neighborhoods • Compara5ve
• Does this visualiza5on allow comparison between more than a few gene neighborhoods?
• If you scale the visual approach up, does it: • Allow more comparisons of gene neighborhoods (Analy5c
Task Scalability) • Take advantage of big displays in size and pixel-‐density
(Display Resolu5on Scalability and Display Size Scalability) • In the process, remain sensible to a human viewer
(Perceptual scalability)
Line-‐based compara5ve approaches • On load, align 1-‐2 genes to a chosen gene in a reference genome
• Draw a line or a band to connect orthologs
• In many cases, repurpose genome browsers to be compara5ve by adding compara5ve track
• Tools: PSAT, GBrowse_syn, SynView, ACT, CGAT, Combo, MizBee, Mauve
Pan, X. et al. (2005). SynBrowse: a synteny browser for compara5ve sequence analysis. Bioinforma)cs (Oxford, England).
McKay et al. Using the Generic Synteny Browser (GBrowse_syn). Current protocols in Bioinforma)cs Hoboken, NJ, USA: John Wiley & Sons
Line-‐based approaches expanded: Mauve
• Like parallel coordinates
• Draw lines between orthologs
• Color genes by their block with that genome (not colored by orthology)
• Example shows 9 genomes
Darling, Aaron CE, et al. "Mauve: mul5ple alignment of conserved genomic sequence with rearrangements." Genome research 14.7 (2004): 1394-‐140
Line-‐based approaches: Cri5que • Pixel-‐density scalable?
– Not a high-‐density representa5on – Need space for the ‘compara5ve track’
• Display size scalable? – Hard to follow lines across a display – Hard to compare similar neighborhoods
across the display – No overview from a distance, details up
close • Perceptual scalability for comparing
gene neighborhoods? – Lots of visual clujer – Comparisons not pre-‐ajen5ve – No aid to visual search
• Number of genomes – Published up to 9 – Private groups have adapted frameworks
for 10-‐50 genomes on big display
Darling, Aaron CE, et al. "Mauve: mul5ple alignment of conserved genomic sequence with rearrangements." Genome research 14.7 (2004): 1394-‐140
PSAT: Color and alignment
• PSAT – Orthologs encoded using color
– Strand on which gene is posi5oned is encoded by orienta5on to the center line
– Text is given by default
Fong, Chris5ne, et al. "PSAT: a web tool to compare genomic neighborhoods of mul5ple prokaryo5c genomes." BMC bioinforma5cs 9.1 (2008): 170.
PSAT: Cri5que
• Pixel-‐Density Scalability – Not high-‐density representa5on because of text labels
• Perceptual scalability for comparing gene neighborhoods? – Can’t scale to large number of genes-‐ not enough colors
Fong, Chris5ne, et al. "PSAT: a web tool to compare genomic neighborhoods of mul5ple prokaryo5c genomes." BMC bioinforma5cs 9.1 (2008): 170.
GeneRiViT: Alignment and color
• GeneRiViT – Align against arbitrary gene
– Color by presence/absence
– Examples show 4 genomes – Cri5que:
• No discussion of scalability • Overview visualiza5on • Doesn’t address our problem
Price, A. et al "Gene-‐RiViT: A visualiza5on tool for compara5ve analysis of gene neighborhoods in prokaryotes." Biological Data Visualiza5on (BioVis), 2012 IEEE Symposium on. IEEE, 2012.
Dot plots • Coordinates of genes in two genomes are used as x and y axis
• Orthologous genes in other genomes are plojed
• Each genome given a unique color
• Cri5que: – Doesn’t provide ‘gene-‐neighborhood’ view
– Overview tool – Hard to follow beyond a few genomes
Price, A. et al "Gene-‐RiViT: A visualiza5on tool for compara5ve analysis of gene neighborhoods in prokaryotes." Biological Data Visualiza5on (BioVis), 2012 IEEE Symposium on. IEEE, 2012.
Overview Visualizaiton: Sequence Surveyor
• Not this domain problem, but interes5ng approach
• Each gene is drawn as a rectangle
• Several possible variables for posi5on: Ordinal posi5on
• Several possible variables for color: – Posi5on in one reference genome
– Use a color ramp, for wide range of colors
Albers,D. et al "Sequence surveyor: Leveraging overview for scalable genomic alignment visualiza5on." Visualiza5on and Computer Graphics, IEEE Transac5ons on 17.12 (2011): 2392-‐2401.
Overview Visualizaiton: Sequence Surveyor
• Pixel-‐density scalable – High-‐density representa5on – High-‐detail representa5on
• Display size scalability – May be difficult to compare pajerns from one side of display to another
• Perceptual Scalability – Colors allow for pre-‐ajen5ve iden5fica5on of pajerns
– Avoids visual clujer
Albers,D. et al "Sequence surveyor: Leveraging overview for scalable genomic alignment visualiza5on." Visualiza5on and Computer Graphics, IEEE Transac5ons on 17.12 (2011): 2392-‐2401.
Copy number varia5ons on big displays
• Orchestral: – Visualiza5on of a different data type – Effec5ve use of color to enable pre-‐ajen5vely iden5fica5on of similari5es across genomes
– High-‐density representa5on – Details-‐up-‐close, overview from a distance
Ruddle, Roy A., et al. "Leveraging wall-‐sized high-‐resolu5on displays for compara5ve genomics analyses of copy number varia5on." Biological Data Visualiza5on (BioVis), 2013 IEEE Symposium on. IEEE, 2013.
BactoGeNIE Demo
Program details • Implemented in C++ using Qt and the QGraphicsView
framework • Upload:
– genome feature files – Fasta files (raw gene sequences)
• Cd-‐hit algorithm processes sequence files to compute ortholog ‘clusters’
• MySQL database to store big datasets – Loads 1000 con5gs into memory, rest stored in database
• Op5mized for PubMed datasets • Prototyped on E.Coli dran genomes
– Capable of displaying any con5gs from thousands of E.Coli dran genomes
• On EVL Cyber-‐commons wall, around 400 con5gs in view
BactoGeNIE: High density representa5on
• Compressed genome encoding
• No text labels, instead ‘on-‐demand’
• No ‘compara5ve track’ • Encode orthology using
– User applied color: pre-‐ajen5ve orthology iden5fica5on
– Coordinated highligh5ng: scalable visual query
– Alignment: use space to encode similarity
Use space to encode similarity • Goals: – Make it easier to perform comparisons across many genomes (Analy5c task scalability)
– Accommodate increased display size (Display Size Scalability)
– Make similari5es and differences easy to see (Perceptual Scalability)
• Sor5ng and Alignment – Sort by con5g length – Sort by gene content – Dynamically align against any gene
Interac5vity • On hovering, con5g expands in height, so easier to select
genes of interest in high-‐density view • User can modify the con5g density, or the gene density
(nucelo5des per pixels) • ‘Pop-‐up’ menu for each gene that gives info and allows for:
– applica5on of color: • ‘tagging’ opera5on • Scalable query
– “targe5ng” opera5on (described next) • User can sort genomes by :
– Gene target – Con5g length: to show common assembly break-‐points in related con5gs
‘Gene Targe5ng’ Func5on to create high resolu5on, compara5ve ‘maps’
• User selects a gene of interest • This gene is given a base color • Two color ramps are applied to adjacent genes, one ‘upstream’ and one ‘downstream’
• Orthologous genes in related genomes are given the same colors
• Con5gs containing this gene are brought to the top
• The target gene is centered • Orthologs are aligned to the target
Gene targe5ng func5on • Clustering to promote direct comparisons
• Overviews at a distance
• Details up close • Pre-‐ajen5ve iden5fica5on of similari5es and differences between gene neighborhoods
Lance Long
Examples
Pixel-‐density Scalability
BactoGeNIE fits the pixel-‐density scalability criteria: High-‐density data display, iden5fier display and orthology encoding
Display Size Scalability
• BactoGeNIE is the only approach to use clustering and show mul5ple levels of detail
Perceptual Scalability and Analy5c Tasks
BactoGeNIE: • Similarity is pre-‐ajen5vely accessible
• Avoids visual clujer
• Visual query for orthologs
Graphical Scalability: Display Resolu5on vs Number of
Genomes
0
100
200
300
400
500
600
700
800
900
1000
480 720 1080 1440 2160 2880 3240 4320
BactoGeNIE
GeneRiViT
SynBrowse
SynView
PSAT
Geco
Mauve
Pixels
Genomes
Preliminary User Feedback • A version of BactoGeNIE used by computa5onal biology team on 2 monitor x
2 monitor 5led display wall
• “This tool has been widely used by members of the team to show the compara)ve analyses of genomic context for several bacterial genomes”
• “Genome browsers such as JBrowse enable researchers to do compara)ve genome analyses for nearly 10-‐50 genomes. But fail to work when we are studying several hundreds of genomes of interest.
• This tool is really unique and it’s the only tool that I am aware of that can scale up to any number of genome comparisons.
• The ability to load mul)ple tracks of genomes, and the zoom in and out op)ons with color coding, annota)on tracks makes it very convenient for scien)sts to quickly look at paXerns.
• This tool has a poten)al to serve both for visualiza)on as well as data mining needs.”
Usage of a version without the gene targe5ng approach. Future study will concentrate on this feature with a wider community of users
Summary of contribu5ons • A novel design that is the first to enable direct comparisons between hundreds of gene neighborhoods in one view
• First interac5ve, large-‐scale compara5ve gene neighborhood approach, with on-‐the-‐fly sor5ng, dynamic alignment, user-‐selected color and color ramps, as well as upload of custom data
• First to show overviews with gene neighborhood-‐details, that can be accessed through physical movement
• Introduces a novel visualiza5on approach ‘gene targe5ng’ that translates genomic data into high-‐resolu5on genomic maps
What’s next? Design • Mul5ple color ramps • Advanced ordering in y, based on similarity to target or strain phylogeny
• Show addi5onal proper5es, such as pathway membership
Implementa5on • Scalability in rendering using paralleliza5on on the GPU • Port to SAGE Evalua5on • More feedback, case studies and evalua5ons of scalability vs other approaches
Scalable Design, Big Data, Big Displays
• Need visualiza5on to provide an interface between automated methods and the expert
• For big data problems, challenge is to represent data effec5vely, avoiding informa5on overload
• Por5ng exis5ng visual approaches to big data and big displays will not always work
• Need to design for increased data volumes and – pixel-‐density – display size – volume of analy5cal tasks
Thanks!
• Acknowledgements: – Jason Leigh, Andy Johnson, Khairi Reda, Lance Long, Uthman Shabazz, and everyone in the Electronic Visualiza5on Laboratory
– Barry Goldman, David Bush, Niran Iyer, Shawn Stricklin and the rest of the computa5onal biology team at Monsanto
• Ques5ons?