a method for validation for clustering of phenotypic gene knockdown profiles using protein-protein...

2
BMC Bioinformatics Poster presentation A method for validation for clustering of phenotypic gene knockdown profiles using protein-protein interactions information Nikolay Samusik, Yannis Kalaidzidis and Marino Zerial Address: Max-Plank Institute of Cell Biology and Genetics, Dresden, Germany from Fifth International Society for Computational Biology (ISCB) Student Council Symposium Stockholm, Sweden 27 June 2009 Published: 19 October 2009 BMC Bioinformatics 2009, 10(Suppl 13):P3 doi: 10.1186/1471-2105-10-S13-P3 This article is available from: http://www.biomedcentral.com/1471-2105/10/S13/P3 © 2009 Samusik et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The functional characterization of the molecular com- ponents of biological systems is an outstanding task in the post-genomic era and is often addressed by gene knockdown screening which quantifies phenotypes in the form of multiparametric profiles. The identification of sets of genes producing similar phenotypic effects is fulfilled by the clustering of profiles. Here we address the question of how to quantify the biological significance of clusters of phenotypic profiles with respect to protein- protein interaction (PPI) data. The dataset we used contains 20385 gene knockdown phenotypic profiles resulting from the genome- wide image-based multiparametric siRNA screen on endocytosis in HeLa cells, which has been carried out on in our lab. Each profile is produced from images of cells in a specific gene knockdown condition, and comprises a 40-dimensional vector of phenotypic parameters, which describe the distribution of endo- somes labeled by fluorescent cargo. The profiles capture the morphological changes in the endocytic system resulting from different gene knockdowns. The dataset has been pre-filtered, discarding profiles showing weak effects. Protein-protein interaction data was taken from EMBL STRING database [1]. We demonstrated that the inter- acting proteins are 17% more likely to produce a correlated phenotype in the siRNA screen than non- interacting proteins. Based on this observation, we developed a method for the cross-validation of the clustering of phenotypic profiles using PPI data. The relevance of a cluster with respect to PPI is measured by the ratio of the number of interactions between genes in the cluster to the total number of interactions of these genes, and by the average size of a fully connected PPI module in the cluster. The significance of these values for a cluster of a size N is assessed by bootstrapping, using sets of N nearest neighbors of random phenotypic profiles, and the p-value combines both measures. We have applied this method to the results of clustering of the phenotypic profiles by means of a modified version of the quick shift[2] density-based clustering algorithm equipped with k-nearest neighbor (kNN) adaptive [3] von Mises-Fisher kernel for density estimate. The proposed measure shows significant p-value for the majority of the clusters. The p-value is sensitive to the artificial deteriora- tion of cluster quality by addition of adjacent datapoints or removal of those on the border, and thus must allow optimization of clustering parameters. To demonstrate that, we have created a method for the tree-cutting in hierarchical clustering which maximizes the significance of resulting clusters with respect to the measure we developed. Page 1 of 2 (page number not for citation purposes) BioMed Central Open Access

Upload: nikolay-samusik

Post on 30-Sep-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

BMC Bioinformatics

Poster presentationA method for validation for clustering of phenotypic geneknockdown profiles using protein-protein interactions informationNikolay Samusik, Yannis Kalaidzidis and Marino Zerial

Address: Max-Plank Institute of Cell Biology and Genetics, Dresden, Germany

from Fifth International Society for Computational Biology (ISCB) Student Council SymposiumStockholm, Sweden 27 June 2009

Published: 19 October 2009

BMC Bioinformatics 2009, 10(Suppl 13):P3 doi: 10.1186/1471-2105-10-S13-P3

This article is available from: http://www.biomedcentral.com/1471-2105/10/S13/P3

© 2009 Samusik et al; licensee BioMed Central Ltd.This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The functional characterization of the molecular com-ponents of biological systems is an outstanding task inthe post-genomic era and is often addressed by geneknockdown screening which quantifies phenotypes inthe form of multiparametric profiles. The identificationof sets of genes producing similar phenotypic effects isfulfilled by the clustering of profiles. Here we address thequestion of how to quantify the biological significanceof clusters of phenotypic profiles with respect to protein-protein interaction (PPI) data.

The dataset we used contains 20385 gene knockdownphenotypic profiles resulting from the genome-wide image-based multiparametric siRNA screen onendocytosis in HeLa cells, which has been carried outon in our lab. Each profile is produced from images ofcells in a specific gene knockdown condition, andcomprises a 40-dimensional vector of phenotypicparameters, which describe the distribution of endo-somes labeled by fluorescent cargo. The profiles capturethe morphological changes in the endocytic systemresulting from different gene knockdowns. The datasethas been pre-filtered, discarding profiles showing weakeffects.

Protein-protein interaction data was taken from EMBLSTRING database [1]. We demonstrated that the inter-acting proteins are 17% more likely to produce a

correlated phenotype in the siRNA screen than non-interacting proteins. Based on this observation, wedeveloped a method for the cross-validation of theclustering of phenotypic profiles using PPI data. Therelevance of a cluster with respect to PPI is measured bythe ratio of the number of interactions between genes inthe cluster to the total number of interactions of thesegenes, and by the average size of a fully connected PPImodule in the cluster. The significance of these values fora cluster of a size N is assessed by bootstrapping, usingsets of N nearest neighbors of random phenotypicprofiles, and the p-value combines both measures.

We have applied this method to the results of clustering ofthe phenotypic profiles by means of a modified versionof the “quick shift” [2] density-based clustering algorithmequipped with k-nearest neighbor (kNN) adaptive [3] vonMises-Fisher kernel for density estimate. The proposedmeasure shows significant p-value for the majority of theclusters. The p-value is sensitive to the artificial deteriora-tion of cluster quality by addition of adjacent datapoints orremoval of those on the border, and thus must allowoptimization of clustering parameters.

To demonstrate that, we have created a method for thetree-cutting in hierarchical clustering which maximizesthe significance of resulting clusters with respect to themeasure we developed.

Page 1 of 2(page number not for citation purposes)

BioMed Central

Open Access

References1. Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J,

Doerks T, Julien P, Roth A, Simonovic M, Bork P and von Mering C:STRING 8 – a global view on proteins and their functionalinteractions in 630 organisms. Nucleic Acids Res 2009, 37Database: D412–416, Epub 2008 Oct 21.

2. Vedaldi A and Soatto S: Quick Shift and Kernel Methods forMode Seeking. Proceedings of the European Conference on ComputerVision 2008, 5305:705–718.

3. Georgescu B, Shimshoni I and Meer P: Mean shift basedclustering in high dimensions: a texture classificationexample. Proceedings of Ninth IEEE International Conference onComputer Vision 2003, 1:456–463.

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

BMC Bioinformatics 2009, 10(Suppl 13):P3 http://www.biomedcentral.com/1471-2105/10/S13/P3

Page 2 of 2(page number not for citation purposes)