statweb.stanford.edustatweb.stanford.edu/~gwalther/autogateeng.docx · web viewin simple terms,...

AutoGate, an automated approach to flow data analysis

Leonore A. Herzenberg*, Stephen Meehan*#, Guenther Walther# David Parks*, Wayne Moore*, Connor Meehan&, Megan Philips*, Eliver Ghosn*, and Leonard A. Herzenberg

*Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305 and #Department of Statistics, Stanford University, Stanford, CA 94305

Years ago, when T and B cells had yet to be distinguished and hematopoietic stem cells were barely known, our laboratory conceived of and built the first Fluorescence Activated Cell Sorter (FACS). Now, nearly half a century later, cell biology is flourishing and flow cytometry instruments built by Becton-Dickenson, Beckman Coulter, J-San, Sony, CYTOF and many others are collecting data and sorting all kinds of cells in clinical and research laboratories all over the world. Mirroring this expansion, the diagnosis and treatment of HIV, leukemia, immunodeficiencies and many other diseases now depends heavily on flow cytometry, as do T cell, B cell and hematopoietic stem cell studies and replacement therapy. From medicine to oceanography, the list of uses for flow instruments is already “infinitely” longer than we and other early developers ever imagined, and is continually being extended as today’s researchers imaginatively develop applications for this still nascent technology.

Of course, flow cytometry instruments as we know them today provide far broader capabilities than the early instruments. Over the years, we and others have turned the original single laser, single detector analysis and sorting instruments into a broad array of analysis and sorting instruments, capable on the high end of detecting up to 18 fluorescence colors and more than double that number of CYTOF mass spec “colors”. In addition, low-end instruments with narrower capabilities and sometimes less sensitivity have been developed to meet needs for simplicity and economy. Thus, the universe of flow instruments and the physicians and scientists who use them (or data collected with them) has grown enormously, both in size and diversity.

This growth means that a high percentage of today’s scientists and clinicians regularly make decisions based on flow cytometry data acquired in experiments or clinical analyses. In addition, it means that a substantially higher percentage of today’s scientists and clinicians regularly encounter, and often must critically evaluate, conclusions that are based on the interpretation of flow data published in a wide variety of biomedical research articles. Thus, while flow cytometry was once a niche technology and flow data interpretation was once the domain of experts, understanding and interpreting flow data has now become pretty much everybody’s business. Thus, the need for simpler and better flow data analysis tools, and the need for simpler and more transparent flow data displays, has never been greater.

We address these needs with the new, soon to be available, automated flow analysis software that we describe here. Provisionally named AutoGate, this software is similar in many ways to current manual analysis methods. However, it uses newly defined statistical algorithms to automatically detect and delineate subsets and to compare representation of the algorithmically-

Herzenberg et al – page 1

defined subsets in putatively similar samples. In simple terms, AutoGate obviates the need to draw arbitrary gates to define subsets (it defines these algorithmically instead). However, it keeps the user firmly in control of the analysis by requiring the user to choose which of newly-defined subsets to subset further and the axes on which to display the newly resultant subset(s).

In the sections that follow, we present a bit of background information and then move to a description AutoGate and the output it generates. Finally, at the end of this article, we discuss our plans for making AutoGate broadly available at low cost (basically to provide funds for support) to users in non-profit institutions.

Flow data collection and storage. The mechanics of flow cytometry data collection have been explained in greater or lesser detail in many publications. Briefly, cells are stained with antibodies and sometimes other reagents that are each associated with a distinctive fluorochrome. When stained cells pass through the flow cytometer, the instrument measures amount of light it scatters (a measure of size and granularity), and the amount of fluorescence emitted by each of the fluorochromes bound to each of the cells (a measure of the amount of each antibody or other reagent bound). FACS instruments examine up to twenty thousand cells per second and record light scatter and fluorochrome levels for each cell. The recorded measurements are stored in a lengthy data file that may, for example, wind up containing light emitted by eight fluorochrome and two size measurements for each of 500,000 cells in a single sample.

Once collected, the datafiles for a given flow experiment are usually packaged into a dataset and either stored centrally or, more often, given to the user to store. Since the full dataset is usually quite large, only a few datasets can readily be stored on typical laboratory computers. Therefore, various laboratories and institutions have made provision for dataset storage, mainly, however, leaving the user or the laboratory to record and maintain access to the dataset and to provide continuity with subsequent analysis output and information about the experiment.

In the late 1990s, Wayne Moore in our laboratory designed and built a data archive to provide for these functions. Still working to provide automatically archiving flow data collected in the Stanford Shared FACS Facility and several other flow facilities at Stanford, facility users need only click once or twice to download previously collected datasets in a form that can directly be used by FlowJo, AutoGate or other analysis programs. Other institutions have built systems that provide some of these functions and several central archiving systems have recently emerged. These latter include CytoBank (cytobank.org), which provides archiving and specialized flow data computations, and CytoGenie (Woodside Logic.com in collaboration with our laboratory), which has yet to be fully implemented.

At this writing, CytoGenie, an extension of AutoGate, is still under development (although some parts have already been released in the current product.. The overall system is designed to provide an indexed experiment notebook linking archived datasets with AutoGate and with the experiment design and analysis information (subjects, samples, reagents, etc.) that makes the datasets interpretable. It is also designed to store and index experiment information and to retain AutoGate analysis output (e.g., graphs and tables and to keep these together with notes and other textual and graphic information commonly recorded in paper notebooks. Hopefully,


funding will soon become available to complete and release the entire product, and to support its inexpensive release to the .edu and .gov community

Fluorescence compensation: the first step in flow data analysis. Data analysis for a flow cytometry dataset begins with a computation to determine the actual amounts of fluorescence emitted by each of the fluorochromes associated each of the reagents bound to each of the cells in the sample. This step, referred to as fluorescence compensation, is unnecessary if each fluorescent reagent is straightforwardly detectable only by the detector designed to reveal its fluorescence and only one detector is used per reagent. However, in multiparameter flow fluorescence experiments, even when only two or three colors are used, the available fluorochromes commonly have broad emission spectra such that the light they emit light overlaps to some extent onto other detector(s) and spuriously contributes to the total fluorescence recorded by that detector. Thus, to accurately measure the fluorescence intended to be detected by the detector, the overlapping fluorescence (determined from its known spectrum and its recorded value in its own channel) must be subtracted from the recorded value for each channel.

This subtraction process, referred to as fluorescence compensation, was originally accomplished via hardware built into the flow instruments (and still available on some instruments). However, it is much more accurately accomplished by modern fluorescence compensation software implementations. (We discuss this in detail in a 2006 review in Nature Immunology[ref], which also provides key background for other the issues discussed here.)

Our laboratory built the original compensation hardware and has led the way in building progressively better software implementations. At present, a variety of implementations are offered by current flow analysis packages and require varying degrees of data manipulation by the user. FlowJo (TreeStar.com), for example, provides a reasonably well automated compensation utility that automatically applies the compensation corrections to the appropriate data sets prior to analysis.

Our most recent fluorescence compensation implementation introduces AutoComp, a fully automated fluorescence compensation utility that currently acquires compensation control information either from CytoGenie, our automated protocol design tool, or from protocol information stored in the FlowJo workspace and augmented by the user when AutoComp acquires the information. Thus, when FlowJo is used as the source, Autocomp acquires and automatically displays the reagent and single-stain control information available from the FlowJo workspace and prompts the user to add/verify the reagent, fluorochrome and control information needed to compute accurate compensation values for each stainset in the experiment.

Once this input is complete, AutoComp uses advanced statistical procedures to automatically compute fluorescence overlap corrections that tend to be much less prone to error. These corrections, output as typical flow cytometry compensation matrices, can readily be imported and deployed by flow cytometry data processing programs (e.g.,FlowJo) to give more accurate representations of the acquired flow data. Our new automated analysis program (AutoGate), which we describe in the sections that follow, automatically imports and applies these matrices, thereby enabling display of fully and correctly compensated.


Step 2. Identification and quantification of subsets of cells detectable in flow datasets. Once the compensation corrections are computed and applied, flow datasets are ready for analysis procedures aimed at identifying subsets of cells and determining the frequency and marker representation on the identified subsets. A variety of analysis tools are available for this purpose. However, we remind the reader that no matter how many subsets are identified within a given cell sample, more subsets may be found when additional markers (more colors) are used to analyze these or additional samples from the same or an appropriately related source. Thus, we usually advise our colleagues to stain samples with as many fluorescence colors as are compatible with their budgets and reagent availability.

Although this approach requires additional fluorescence compensation and subset gating skills, we have found it more fruitful in the long run. In addition, we expect these limitations to disappear with the automated fluorescence compensation software that we have recently developed (see above) and the automated analysis software that we are about to introduce (see below).

The Cytof instruments, which used mass spectrometry measurements rather than fluorescence to identify subsets, offer a much wider range of co-utilizable reagents. However, limitations in the speed of the measurements (the number of cells that can be analyzed per minute) tend to restrict the routine use of these instruments to relatively well represented subsets (or to very patient users). In any event, the automated analysis software discussed in the next section will work equally well with data acquired with mass spectrometry and fluorescence-based reagents,.

Step 3. Automating the gating process. Traditional flow analysis software provides tools for specifying display axes and manually drawing gates around visually defined subsets of cells, and for drawing and applying such gates sequentially to progressively define more refined subsets. This process is relatively easy for novices doing 3-color work, but experience indicates that it is increasingly more difficult as the number of reagents, and hence the number of colors (fluorochromes), in the stainset increases. Beyond six or so colors, gating becomes an art that is usually (and sometimes better) left to the experts.

This limitation is highly unfortunate because, as indicated in the preceding section, the greater the number of reagents used in a stainset, the better the chance of resolving the subsets in a sample, and of determining both the number of cells and subsets in the sample and the expression levels of the determinants on the cells within the subsets. In today’s flow world, even average users working on typical projects with cells from human or animal sources have come to recognize that the complexity of the cell populations with which they work requires them to deal at least with 6-12 color data – and upwards of that if their work requires data collection with today’s high powered FACS and CyTOF instruments.

To address these issues, we have developed AutoGate ̶ a statistically reliable automated flow data analysis package that is easier to use than current software and provides simpler and more straightforward output. Basically, with this new software, the user defines a gating sequence much as it is defined with current software, i.e., by choosing X and Y axes (markers/colors) to display the data, selecting a subset of cells to examine further, and repeating the display and selection process until the desired subset is visible or the choices are exhausted. However,


while current software requires users to draw arbitrary gates to delineate subsets, AutoGate automatically defines and delineates statistically valid subsets within a given dataset or subset and applies the method used to delineate those subsets rather than the boundaries of the subset to delineate similar subsets within like samples. Applied iteratively, this method enables the sequential definition of subsets much the way FlowJo does, but with one key difference: while FlowJo-style analyses define sequential boundaries (gates) for the identification of a subset and then arbitrarily apply those boundaries to define similar subsets in other samples, AutoGate defines the gating “method” on a model sample and then adapts this gating method to suit each sample, defining like subsets when they exist and using novel statistical procedures to identify absent or additional sets when they exist.

In addition, as the rounds of subsetting proceed, AutoGate offers statistically ranked axis choices for each next iteration of subset-defining axes, thus sequentially guiding the iterative selection of subset defining methods that can be selected to create lineal or branched analysis tree methods (aka gating trees) that can be applied to other samples. Other analysis packages can create superficially similar gating analysis trees, but these are based on literal user-defined boundaries (literal gates) rather than on statistically selected gating methods that adapt to the data structures at each level of the tree.

To date, we have developed and tested this method with FACS data sets that include up to 12 fluorescence and two light scatter measurements. However, we expect the method to be equally well usable for analysis of CYTOF and other very high dimensional datasets, including those acquired for data outside the flow arena. Thus, we see AutoGate, together with CytoGenie and AutoComp, as opening complex high-dimensional flow analysis to a broad group of users, most of whom are better trained to understand how flow data informs biomedical studies than to use current technology to extract or evaluate high dimensional flow data.

Flow Cytometry: a past and a future. When we agreed to do this article, we and the editors had in mind the development of an up-to-date version of our 2006 Nature Immunology article, which focusses on explaining how/why to use a set of then novel and now commonplace technologies that empower high-dimensional flow cytometry use in the laboratory and the clinic. However, when we returned to this previous article, we found that it is really about as up-to-date as it needs to be. Although modern flow users would do well to return to this article for a refresher on the ways that “Logicle” (aka bi-exponential) axes are more useful than logarithmic axes for displaying most types of flow data, there is little a new article could do other than to emphasize the importance of these technologies for data analysis and criticism.

We therefore chose, in this arfticle, to step into the future and describe our latest work, which integrates all of the technologies discussed in the earlier article but brings a wholly new, and quite successful statistical approach to subset definition, display, gating and functional analysis. The Logicle scales and other technologies that we discussed previously are used to full advantage in the automated flow analysis software that we have now generated. In fact, we have integrated several upgrades to this technology, thereby removing the need for user intervention to correct for inappropriate axis scaling that sometimes occurred


Overall, we hope that we have conveyed to the reader our excitement at the new capabilities that can now be built with modern software technology. With this and other technology on the verge of being built, we can expect now high-dimensional flow flow studies to transition fairly rapidly from the current expert domain to one accessible to average users with average computers.

As our contribution to the “new way of doing things”, we plan to make the automated CytoGenie flow cytometry analysis suite broadly available to flow users in non-profit institutions (.edu, gov., .org). The suite will include

CytoGenie, which provides integrated tools for experiment planning, optimized stainset design, experiment organization and long-term storage of experiment protocol information;

AutoComp, which fully automates fluorescence compensation;

AutoGate, which empowers automated flow cytometry analysis and application of statistical capabilities including those discussed here

We have designed CytoGenie, AutoComp, and AutoGate to download, install, run without expert intervention on modern laptop and desktop computers. In addition, because we are well aware that funding for basic research is scarce, designed this new software to operate in a non-profit mode. Thus, we are currently negotiating to make support for this new software available, either via government support (if such can be obtained) or via a small “service” subscription fee adequate to provide adequate support for users wanting access to the software. We expect to have more to report on this financial arrangement when the first version of our software is released (hopefully, Q 1-2, 2014).

.


Figure 1 AutoGate starts with markers for gating out singlets. The user selects the singlets and then clicks the gating tool to subset the chosen cells.


Figure 2 Once the chosen subset is chosen the user adjusts the X or Y parameter.


Figure 3 As soon as a change to the X or Y axis is made AutoGate automatically runs the cluster analysis and gives each clusters a different color.


Figure 4 To define a gate the user clicks on each of the clusters that are part of the same gate

AutoGate displays also differ in that the gating tree is displayed as linearly connected, live “thumbnail” representations of the subsets defined for each of nodes in the tree. Any node can be readily queried for statistical and other information about any or all subset represented in the node; and enlarged versions of any node can be readily displayed and exported for notebook storage and data display purposes. In addition, and most important for analysis of full experiment data, the full AutoGate gating tree, or segments thereof, can be applied to similar datasets stained with the same reagent combination.


Figure 5 As the user defines gates in a sample's plot window, AutoGate captures them in a gating tree window.

Thus, once a gating tree is defined for a sample stained with a given stainset, AutoGate provides a simple “drag and drop” method for applying it, or segments thereof, to comparably stained samples of comparable but independent origins. When this is done, AutoGate will


detect corresponding subsets in the new target sample, automatically adjust the new subset boundaries to those that are statistically valid for the new target sample, and report “missing”, “split”, “merged” and “new” subsets, should these occur.

Figure 6 - After the user defines a gating model he drags it on to ungated samples.


Figure 7- When applying the users’ gating model to ungated samples, AutoGate updates the gating tree with matched gates.


Figure 8 - When applying the user’s gating models to ungated samples, AutoGate can detect and flag missing gates.


statweb.stanford.edustatweb.stanford.edu/~gwalther/autogateeng.docx · web viewin simple terms,...

Documents