thesis proposal brainspan visualization · 1 synopsis this thesis project will be performed as part...

10
Thesis proposal BrainSpan Visualization Visualizing genome co-expression data from the BrainSpan Atlas Lennaert van den Brink Delft University of Technology March 3, 2014 1 Synopsis This thesis project will be performed as part of the master Computer Science, track: Media and Knowledge engineering, specialization: Computer Graphics and Animation - Medical Visualization. In this thesis project, a tool or set of visualizations will be developed to visualize genome data from the BrainSpan Atlas 1 . Specifically, the tool will be able to visualize co-expression networks of the human genome and will assist in exploring relationships between specific genes. This will allow researchers to investigate specific genes related to diseases such as Alzheimer’s or Autism Spectrum Disorders. 2 Introduction The Allen Mouse and Human Brain Atlases are projects within the Allen In- stitute for Brain Science which seek to combine genomics with neuro-anatomy by creating gene expression maps for the mouse and human brain. They were initiated in September 2003 with a $100 million donation from Paul G. Allen and the first atlas went public in September 2006. Data is gathered using vari- ous techniques such as the use of postmortem brain samples and brain scanning technology to discover where in the brain genes are turned on and off, and In Situ Hybridization, or ISH. The resulting atlases allow global genome-scale structural analysis and cross-correlation. One of the biggest challenges for any researcher working with these atlases is the vast amount of data involved. For example, when we look at the BrainSpan atlas, an atlas which tracks gene ex- pression throughout development from post-conception to adult, we are dealing with a data set consisting of about 50.000 genes at 31 different ages sampled in 26 different brain structures. Simply visualizing a co-expression network with these amounts of data results in incomprehensible “hairballs” (see figure 1). In order to analyze these types of networks more easily, often sub-networks, or mod- ules, are defined based on correlation [1]. These modules are relevant because they are expected to group genes together that are responsible for individual processes. These modules can then help with several biologically interesting 1 http://www.brain-map.org/ 1

Upload: others

Post on 24-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

Thesis proposal

BrainSpan VisualizationVisualizing genome co-expression data from the BrainSpan Atlas

Lennaert van den BrinkDelft University of Technology

March 3, 2014

1 Synopsis

This thesis project will be performed as part of the master Computer Science,track: Media and Knowledge engineering, specialization: Computer Graphicsand Animation - Medical Visualization. In this thesis project, a tool or set ofvisualizations will be developed to visualize genome data from the BrainSpanAtlas1. Specifically, the tool will be able to visualize co-expression networksof the human genome and will assist in exploring relationships between specificgenes. This will allow researchers to investigate specific genes related to diseasessuch as Alzheimer’s or Autism Spectrum Disorders.

2 Introduction

The Allen Mouse and Human Brain Atlases are projects within the Allen In-stitute for Brain Science which seek to combine genomics with neuro-anatomyby creating gene expression maps for the mouse and human brain. They wereinitiated in September 2003 with a $100 million donation from Paul G. Allenand the first atlas went public in September 2006. Data is gathered using vari-ous techniques such as the use of postmortem brain samples and brain scanningtechnology to discover where in the brain genes are turned on and off, andIn Situ Hybridization, or ISH. The resulting atlases allow global genome-scalestructural analysis and cross-correlation. One of the biggest challenges for anyresearcher working with these atlases is the vast amount of data involved. Forexample, when we look at the BrainSpan atlas, an atlas which tracks gene ex-pression throughout development from post-conception to adult, we are dealingwith a data set consisting of about 50.000 genes at 31 different ages sampled in26 different brain structures. Simply visualizing a co-expression network withthese amounts of data results in incomprehensible “hairballs” (see figure 1). Inorder to analyze these types of networks more easily, often sub-networks, or mod-ules, are defined based on correlation [1]. These modules are relevant becausethey are expected to group genes together that are responsible for individualprocesses. These modules can then help with several biologically interesting

1http://www.brain-map.org/

1

Page 2: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

research questions. For example, one could look at modules where a suspecteddisease gene is present to gain more insight into the effect of such a gene on theentire system. A certain condition might be caused by a change in co-regulation,but this will not be noticeable if a single suspected gene is analyzed. By lookingat modules where suspected genes are co-expressed, such relationships can moreeasily be discovered. While these modules help break up the large data set intosmaller sub-networks, these sub-networks each still consist of hundreds of genesand even more connections. New visualization techniques can be used to reducevisual clutter and help identify important relationships more easily.

Figure 1: A typical “hairball” resulting from visualizing a large co-expressionnetwork

3 Motivation

The Allen Brain Atlas is a prime example of the large “Big Data” databasesthat are quickly rising in popularity. Technical advances have made it possi-ble to generate and store massive amounts of data. For the field of genomics,databases such as the Allen Brain Atlas allow researchers a new angle of ap-proach. These large throughput techniques have allowed an expansion fromsingle-gene to genome-wide transcriptional analyses. And not only the medicalsector has benefited from these techniques. Large companies use data gatheredon costumers to improve their marketing, governments try to identify potentialterrorist threats using social networks, and so on. However, one of the biggestchallenges with these databases is actually extracting relevant information fromthem. The potential is huge, but the sheer amount of data makes it hard toidentify structural patterns or significant relationships.As mentioned earlier, high throughput data collection allows for new paths ofresearch in the field of genomics. In this thesis, we will focus on the use ofgenomics in neuroscience. In this field, researchers aim to gain better insightin the workings of the human brain. While we know a lot about the humananatomy and the inner workings of our bodies, knowledge of the brain is muchshallower. Many diseases situated in the brain, such as Alzheimer’s and autismare shrouded in mystery both in cause and effect. Using the large amounts ofdata made available by the new techniques, researchers try to understand moreabout these diseases in hopes of eventually being able to develop treatments for

2

Page 3: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

these diseases.While techniques in this thesis are primarily aimed at improving the data explo-ration and analysis experiences in the field of brain genomics, these techniquescould also be applied to other sets of “Big Data” in other fields as well. As col-lecting data becomes easier and cheaper, the need for visualization techniquestargeted at these data sets increases.

4 Description of input data

Figure 2: Expression levels are sampled for each gene in different structures andacross multiple timepoints

One of the key mechanics in bioinformatics is the so-called gene expression.A gene is a stretch of DNA strand that encodes information. To make use ofthis information, a gene is transcribed by RNA polymerase (RNAP), producingmessenger RNA (mRNA). By measuring how much of this mRNA is present ina cell we get the so called expression value. Different cells can have differentexpression patterns, so these measurements can be taken across multiple brainstructures. In addition, genes expression levels can vary over time. For exam-ple, some genes might be very expressive during the early developmental stages,while other genes are more active during adulthood stages.Therefore, the resulting data structure is four dimensional (see figure 2). Throughcorrelating the expression values over time, structure or both the dimensionalitycan be reduced to a comprehensible level. In this thesis we will work with thesegene correlation networks.

3

Page 4: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

5 Related work

Several tools have been developed to help visualize genome networks in the past.One of the most well known examples is Cytoscape [2] (figure 3). Cytoscape isan open source visualization tool that allows the user to visualize the networksusing several different lay-outs such as a circular, grid or force-directed layout.In addition, Cytoscape allows for the creation of sub-networks and is capableof computing some advanced network statistics for analysis, such as clusteringcoefficients and centrality measures. Another strong advantage for Cytoscapeis the possibility to create plug-ins for the tool, allowing users with program-matic knowledge to implement their own layouts and analysis algorithms. Inthe 3.1 release of Cytoscape functionality is implemented to export to JSONobjects, allowing visualizations to be exported and displayed in web browsersusing cytoscape.js. The tool was developed with a biomedical purpose in mindand therefore is largely supported by the biomedical community. While Cy-toscape allows for some clutter reduction in the form of a basic edge-bundlingimplementation, it requires a lot of parameters to be set and therefore is onlyeffective if the user has specific expert knowledge of the visualization algorithm.For large networks, the edge bundling (as well as the more complex lay-out)algorithms tend to produce out of memory errors and are generally very slow.This is because natively, Cytoscape does not utilize the GPU and performs allalgorithms on the CPU. GPU acceleration is present in several apps, but is notincluded in the core Cytoscape distribution.

Figure 3: Screenshot of the Cytoscape tool. Cytoscape allows the user to applyseveral layouts to the network as well as some advanced analytics.

Gephi [3] (figure 4) is similar in functionality to Cytoscape. Like Cytoscape,in Gephi users can apply several different layout algorithms to a graph usinga graphical interface. Gephi also provides several advanced graph metrics andanalysis tools. The advantage Gephi has over Cytoscape is that it uses a graphicengine that is better equipped to handle the very large networks. Where Cy-toscape displays a loading bar when applying the heavier layouts such as a forcedirected algorithm, Gephi shows the intermediate steps and allows real-timeinteraction during the layout process. Gephi is targeted at a broad range ofnetwork types and is not as well known or supported in the biomedical com-munity as Cytoscape is. Therefore, the field specific analysis and visualizationtools are less developed in Gephi. Also, Gehpi requires a dedicated, OpenGL

4

Page 5: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

3D compatible graphics card to be present, whereas Cytoscape can run withoutone.

Figure 4: Screenshot of the Gephi tool. Gephi posesses similar functionality toCytoscape, but employs a graph visualization engine that is better equipped tohandle very large networks

An alternative to Cytoscape and Gephi is Circos [4] (figure 5). Circos isa visualization tool that primarily produces network plots in a circular layout.The static images generated by Circos are generally visually pleasing, but lackany form of interaction. Another disadvantage of Circos is that is generates itsimages based on configuration files that the user has to write himself. There isno GUI available for Circos, which makes the learning curve for the tool ratherhigh. Circos does allow for different forms of visual representation in additionto the circular layout, such as scatter plots and heat maps, but these are alwaysdisplayed within the circular layout.

Figure 5: Image generated using Circos. Circos generates network plots with acircular layout, but allows other types of visualizations to be placed along theedges.

5

Page 6: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

Parikshak et al [5] provide an online visualization2 of their research in theform of two network plots. The first network plot shows a set of modules con-taining suspected Autism Spectrum Disorder related genes. The edges betweenthe modules denote the correlation of the modules based on a so called eigen-gene, which is basically an average expression profile for that module. Selectinga module takes the user to the second network plot which plots the top x con-nections, based on their correlation, and the associated nodes (figure 6). Thenodes in both plots are placed according to a force-directed algorithm. One bigdisadvantage of this visualization is the hard threshold of the amount of edgesand nodes that are shown. By default the top 100 edges, based on correlationstrength, and associated nodes are shown. However, the user has no indicationof the actual weights of these edges. The 101st edge might be almost as highlycorrelated as the 100th node, but will not be shown. Also, because only thetop x edges are shown, not all nodes present in the module are shown, causingloss of structural context. A node might appear to have only a single connec-tion while it might actually have a cluster of slightly less correlated neighboringnodes. In the visualization, nodes associated to ASD related genes are coloredred, allowing the user to easily identify them within the cluster. Clicking onany node gives additional information as text.

Figure 6: network plot displaying the top 100 correlated edges between genesin an earlier calculated module.

Another visualization is presented by Pfister3 (figure 7). She visualizes corre-lations between different brain structures (nodes) based on the gene expressiondynamics during embryonic mouse development with data from the Mouse AllenBrain Atlas. Highly correlated structures in time are connected with short, thicklines, whereas poorly correlated structures are distant and connected with weaklinks. Nodes are placed using a force directed algorithm. Since correlationsare taken across time rather than in a single time snapshot, links connect onlystructures from consecutive developmental stages. Due to these constraints, theresulting graph structure has biological relevance.

2http://geschwindlab.neurology.ucla.edu/sites/all/files/networkplot/

ParikshakDevelopmentalCortexNetwork.html3http://n.ethz.ch/student/sabpfist/AllenBrain/timeGraph.html

6

Page 7: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

Figure 7: network plot displaying correlation of gene expression data withinstructures over several developmental stages.

6 Research questions and goals

The main goal of this thesis will be defined as:

• Designing and implementing a visualization tool or set of visualizations toaid in the exploration and analysis of genome data from the BrainSpanatlas in general and disease related genes in particular

Ideally, these visualizations will help biological scientist to answer the fol-lowing questions:

• Which genes show similar expression patterns across the brain?

• Are there sub-networks of the entire genome which are likely influenced/perturbedby certain (disease-related) genes?

Since we are dealing with large amounts of data, resulting in complicated net-work plots we will need to research and implement techniques to help navigatethese complex visualizations. This leads to the following research questions:

• What network layouts are most suitable for displaying genome correlationdata?

• Which clutter reduction and interaction techniques can be applied to aidin the exploration and analysis of such networks?

• What network measurements and analytics are relevant and how can wevisualize them?

• How can we visualize the presence, connections and influence of diseaserelated genes?

7

Page 8: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

This thesis project will mainly deal with the graphics and visualization as-pect of this problem therefore, I will not focus my attention to the followingsubjects:

• Finding biologically relevant clusters in the genomic data

• Best or most appropriate clustering methods

• Biological interpretation of results from the visualization

Where applicable I will refer to available literature and and collaborators inthe pattern recognition and bioinformatics field instead:

7 Initial Planning and staging

The Thesis project has been started on January 8, 2014. The thesis project isequivalent to 45 ECTS which converts to 32 full time weeks of 40 hours each.As a result, the project is expected to be finished around August or September2014. Below is a description of thesis milestones and deliverables as required bythe faculty of EEMCS4. Rough planning is provided in parentheses.

Milestone M0: Thesis Proposal accepted (February 2014)Deliverable D0: A thesis proposal conforming to the EEMCS Master The-sis Proposal Guidelines (February 9, 2014)

Milestone M1: Research literature document approved (March 2014)Deliverable D1: Research assignment report.Deliverable D2: Plan for full thesis project.

Milestone M2: Presentation given at student colloquium (May 2014)Deliverable D3: Electronic version of the slides.

Milestone M3: Grade given by supervisor (September 2014)Deliverable D4: Thesis report (July 2014).

8 Risk Analysis

At the start of this project there are no external risk factors to be reported inthe sense of availability. All courses required for graduation have been finishedand there is no other projects or jobs interfering with planning.One potential risk factor is the fact that I am educated as computer scientistwith a graphics specialization. The field of bio-informatics and systems biologyis unknown to me. Therefore, some extra studying is required to be able tounderstand the needs and challenges of the end-user of the visualization to bedesigned. To fill this knowledge gap, a course on the subject might be attended(see the section “List of Courses”). In addition, PhD student Ahmed Mahfouzwill be consulted in several stages of the project.

4http://studenten.tudelft.nl/en/eemcs/graduation-policy-msc/

8

Page 9: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

9 Expected Deliverables

Besides the deliverables as required by the faculty of EEMCS for a thesis project(as described in the section on planning and initial staging), this project willresult in code and documentation for a visualization tool.

10 Contact Details

StudentLennaert van den Brink, BsCBalthasar van de Polweg 5262628BT DelftEEMCS room: HB 12.310tel: +31 (0)649902900mail: [email protected]

SupervisorDr. A. VilanovaMekelweg 42628 CD DelftEEMCS room: HB 11.270tel: +31 (0)152783107mail: [email protected]

11 Supervision Details

Weekly meetings are scheduled on Tuesdays 11:00-12:00 at room HB 11.270 ofbuilding 36, Mekelweg 4, 2628CD Delft.

12 Intellectual Property Details

In the field of systems biology and functional genomics, it is custom to releasetools and resources as Open Source. Therefore, the code produced in this projectis to be released under an open source license yet to be determined.

13 List of Courses

There are no additional courses that need to be followed as part of the individualstudy program (ISP). However, some lectures of the course IN4176 - FunctionalGenomics and Systems Biology will be attended to help provide better insightinto the field of genomics, the target audience of the visualizations to be createdduring the thesis project.

References

[1] B. Zhang, S. Horvath, et al., “A general framework for weighted gene co-expression network analysis,” Statistical applications in genetics and molec-ular biology, vol. 4, no. 1, p. 1128, 2005.

9

Page 10: Thesis proposal BrainSpan Visualization · 1 Synopsis This thesis project will be performed as part of the master Computer Science, ... The Allen Mouse and Human Brain Atlases are

[2] M. S. Cline, M. Smoot, E. Cerami, A. Kuchinsky, N. Landys, C. Workman,R. Christmas, I. Avila-Campilo, M. Creech, B. Gross, et al., “Integrationof biological networks and gene expression data using cytoscape,” Natureprotocols, vol. 2, no. 10, pp. 2366–2382, 2007.

[3] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: an open source softwarefor exploring and manipulating networks.,” in ICWSM, pp. 361–362, 2009.

[4] M. Krzywinski, J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman,S. J. Jones, and M. A. Marra, “Circos: an information aesthetic for com-parative genomics,” Genome research, vol. 19, no. 9, pp. 1639–1645, 2009.

[5] N. N. Parikshak, R. Luo, A. Zhang, H. Won, J. K. Lowe, V. Chandran,S. Horvath, and D. H. Geschwind, “Integrative functional genomic analysesimplicate specific molecular pathways and circuits in autism,” Cell, vol. 155,no. 5, pp. 1008–1021, 2013.

10