foundations 6 data minimg

Data MiningEdward R. Dougherty

Department of Electrical and Computer Engineering Center for Bioinformatics and Genomic Systems EngineeringTexas A&M University

*gsp.tamu.eduReadingBook: Chapter 8Papers: Paper: Dougherty, E. R., Prudence, Risk, and Reproducibility in Biomarker Discovery, BioEssays, Vol. 34, No. 4, 277-279, 2012.

*

*gsp.tamu.eduKnowledge DiscoveryKnowing the constitution of scientific knowledge and how to validate it leaves open the question of how to discover knowledge.Obviously, we need to observe Nature, but in what manner.

*

gsp.tamu.eduBacon on Planned ExperimentsFrancis Bacon (Novum Organum, 1620): There remains simple experience which, if taken as it comes, is called accident; if sought for, experiment. But this kind of experience isa mere groping, as of men in the dark But the true method of experience, on the contrary, first lights the candle, and then by means of the candle shows the way; commencing as it does with experience duly ordered and digested, not bungling or erratic, and from it educing axioms, and from established axioms again new experiments.

*

gsp.tamu.eduExperimental Design: The Path of ProgressImmanuel Kant (Critique of Pure Reason,1781): It is only when experiment is directed by rational principles that it can have any real utility. Reason must approach nature with the view, indeed, of receiving information from it, not, however, in the character of a pupil, who listens to all that his master chooses to tell him, but in that of a judge, who compels the witnesses to reply to those questions which he himself thinks fit to propose. To this single idea must the revolution be ascribed, by which, after groping in the dark for so many centuries, natural science was at length conducted into the path of certain progress.

*

gsp.tamu.eduJudicious Feature SelectionJames Clerk Maxwell: The feature which presents itself most forcibly to the untrained inquirer may not be that which is considered most fundamental by the experienced man of science; for the success of any physical investigation depends on the judicious selection of what is to be observed as of primary importance, combined with a voluntary abstraction of the mind from those features which, however attractive they appear, we are not yet sufficiently advanced in science to investigate with profit.

*

*gsp.tamu.eduAn Experiment is a QuestionHans Reichenbach (Rise of Scientific Philosophy): An experiment is a question addressed to Nature.As long as we depend on the observation of occurrences not involving our assistance, the observable happenings are usually the product of so many factors that we cannot determine the contribution of each individual factor to the total result.

*

*gsp.tamu.eduReasoning to Science Hans Reichenbach: By means of the artificial occurrences of planned experiments, the complex occurrence of Nature is thus analyzed into its components. That Greek science did not use experiments in any significant way proves how difficult it was to turn from reasoning to empirical science.Science is not constituted by reasoning about data; it is constituted by pragmatic, predictive models.

gsp.tamu.edu*

*gsp.tamu.eduFoolish Questions Yield Foolish AnswersArturo Rosenblueth and Norbert Wiener: An experiment is a question. A precise answer is seldom obtained if the question is not precise; indeed, foolish answers i.e., inconsistent, discrepant or irrelevant experimental results are usually indicative of a foolish question.

*

gsp.tamu.eduModels Depend on Questions Asked Werner Heisenberg: The most important new result of nuclear physics was the recognition of the possibility of applying quite different types of natural laws, without contradiction, to one and the same physical event. This is due to the fact that within a system of laws which are based on certain fundamental ideas only certain quite definite ways of asking questions make sense, and thus, that such a system is separated from others which allow different questions to be put.

*

gsp.tamu.eduMere Observation Hannah Arendt: [Natural science] seemed to be liberated by the discovery that our senses by themselves do not tell the truth. Henceforth, sure of the unreliability of sensation and the resulting insufficiency of mere observation, the natural sciences turned toward the experiment, which, by directly interfering with nature, assured the development whose progress has ever since appeared to be limitless.

*

gsp.tamu.eduAnswers Without Questions Hannah Arendt: The experiment being a question put before nature (Galileo), the answers of science will always remain replies to questions asked by men; the confusion in the issue of objectivity was to assume that there could be answers without questions and results independent of a question-asking being.

*

gsp.tamu.eduEfficient Experimentation Douglas Montgomery: If an experiment is to be performed most efficiently, then a scientific approach to planning the experiment must be considered. By the statistical design of experiments we refer to the process of planning the experiment so that appropriate data will be collected, which may be analyzed by statistical methods resulting in valid and objective conclusions. The statistical approach to experimental design is necessary if we wish to draw meaningful conclusions from the data.

*

*gsp.tamu.eduSome algorithm is proposed.The algorithm separates some data set.We are not told the distribution from which the data come.An estimation rule is used to estimate the error.We are given no reason why the estimate should be good.In fact, often we expect that the estimate is not good.The estimate is small and the algorithm is claimed to be validated.We are given no justification for the claim.We are given no conditions under which it is valid.

Everyday Classification

*

gsp.tamu.eduData Mining Definition 1(Merriam-Webster) Type of database analysis that attempts to discover useful patterns or relationships in a group of data. The analysis uses advanced statistical methods, such as cluster analysis, and sometimes employs artificial intelligence or neural network techniques. A major goal of data mining is to discover previously unknown relationships among the data, especially when the data come from different databases.Relations among data, not among variables no science!

*

gsp.tamu.eduData Mining Definition 2 (Wikipedia) Data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms, and support vector machines. Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. Uncovering patterns in data sets no science!

*

*gsp.tamu.eduData MiningData mining is a return to pre-Baconian groping, albeit, at a much faster groping rate than was then possible.It suffers from three debilitating properties: It does not ask precise questions. There is no statistical characterization of the procedure.As opposed to pattern recognition, it lacks a characterization of prediction in the context of a distribution.Sometimes it is justified by large sample theory, typically absent a rigorous analysis to the problem at hand.

*

gsp.tamu.eduStatistics for the Proletariat Julian L. Simon (Resampling: The New Statistics, 1997): Monte Carlo resampling simulation takes the mumbo-jumbo out of statistics and enables even beginning students to understand completely everything that is done. Even many experts are unable to understand intuitively the formal mathematical approach to the subject. Clearly, we need a method free of the formulas that bewilder almost everyone.Everyday common sense should replace the mumbo-jumbo of scientific rigor and, to a great extent, it has.

*

gsp.tamu.eduThe Numbers Speak for Themselves Chris Anderson (The End of Theory: The Data Deluge Makes the Scientific Method Obsolete): The more we learn about biology, the further we find ourselves from a model that can explain it. There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot With enough data, the numbers speak for themselves.

*

Consistency (Asymptotic Convergence)For a sample S of size n, there is a design cost: n = n Bayes. A classification rule is consistent if E[n] 0 as n .An error estimator is consistent if the estimate converges to the true error as n .What good is this for small samples?

*gsp.tamu.eduAsymptotic Convergence is Irrelevant Appeal to laws of large numbers or central limit theorems in small-sample settings is unwarranted.Training-data-based error estimation methods, such as cross-validation and bootstrap, converge asymptotically as the sample size goes to infinity, but this is of virtually no value for small samples.

*

gsp.tamu.eduAsymptopiaEdward Leamer: Two of the latest products-to-end-all-suffering are nonparametric estimation and consistent standard errors, which promise results without assumptions, as if we were already in Asymptopia where data are so plentiful that no assumptions are needed By

disguising the assumptions on which nonparametric methods and consistent standard errors rely, the purveyors of these methods have made it impossible to have an intelligible conversation about the circumstances in which their gimmicks do not work well and ought not to be used. As for me, I prefer to carry parameters on my journey so I know where I am and where I am going, not travel stoned on the latest euphoria drug.

*

*gsp.tamu.eduTackling Small Sample ProblemsRonald A. Fisher (1925): Little experience is sufficient to show that the traditional [large sample] machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow!...

The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data.

*

gsp.tamu.eduFull Knowledge is in Sampling Distribution Harald Cramer, 1946: It is clear that a knowledge of the exact form of a sampling distribution would be of a far greater value than the knowledge of a number of moment characteristics or a limiting expression for large values of n. Especially when we are dealing with small samples, as is often the case in the applications, the asymptotic expressions are sometimes grossly inadequate, and a knowledge of the exact form of the distribution would then be highly desirable.

*

*gsp.tamu.eduHumean Trap: Data Without ReasonHume (Treatise of Human Nature): The mind is a kind of theatre, where several perceptions successively make their appearance; pass, repass, glide away, and mingle in an infinite variety of postures and situations. There is properly no simplicity in it at one time, nor identity in different [times].

A definition of radical empiricism.Data mining.

*

gsp.tamu.eduNecessity of an Intelligent Idea William Barrett (Illusion of Technique): The absence of an intelligent idea in the grasp of a problem cannot be redeemed by the elaborateness of the machinery one subsequently employs.

I

*

gsp.tamu.eduThe Imprint of Mind William Barrett (Illusion of Technique): The scientists mind is not a passive mirror that reflects the facts as they are in themselves (whatever that might mean); the scientist constructs models, which are not found among the things given him in his experience, and proceeds to impose those models upon Nature. And he must often construct those models conceptually before they are translated at any point into the material constructions of his apparatus in the laboratory.The imprint of mind is everywhere on the body of this science, and without the founding power of mind it would not exist.

I

*

*gsp.tamu.eduRadical Empiricism Denies KnowledgeHans Reichenbach (Rise of Scientific Philosophy): A mere report of relations observed in the past cannot be called knowledge. If knowledge is to reveal objective relations of physical objects, it must include reliable predictions. A radical empiricism, therefore, denies the possibility of knowledge.A collection of measurements, together with statements about the measurements, is not scientific knowledge.

*

*gsp.tamu.eduA Huge Challenge Janet Woodcock (Director, Center for Drug Evaluation and Research, FDA): [As much as 75 percent of published biomarker associations are not replicable] This poses a huge challenge for industry in biomarker identification and diagnostics development.

Dougherty, E. R., Prudence, Risk, and Reproducibility in Biomarker Discovery, BioEssays, 34(4), 277-279, 2012.Yousefi, M., and E. R. Dougherty, Performance Reproducibility Index for Classification, Bioinformatics, 28(21), 2824-2833, 2012.

*

*gsp.tamu.eduReporting Bias When Using Real Datam data sets of size n, LDA with 10-fold cross-validationest(0) and true(0) are the estimated and true errors for the sample with the lowest error estimate, and E[true] is expected true error over all samples.Left: est(0) true(0); right: est(0) E[true]; n = 60, 120.

Yousefi, M. R., Hua, J., Sima, C., and E. R. Dougherty, Reporting Bias When Using Real Data Sets to Analyze Classification Performance, Bioinformatics, 26 (1), 68-76, 2010.

*

Multiple-Rule BiasUse r classification rules and s error estimation rules. Select the pair with the minimum estimated error, min,est... Bias(m) = E[min,est true(imin)], over sampling distribution, m = rs, n = 60.

Yousefi, M. R., Hua, J., and E. R. Dougherty, Multiple-Rule Bias in the Comparison of Classification Rules, Bioinformatics, 27(12), 1675-1683, 2011.

Reproducibility Performance IndexA preliminary study of size n is reproducible with accuracy 0 if n nest + .A follow-on study will be performed if nest .Rn(, ) = P(n nest + | nest ).Real data sets: LDA, n = 60, 5 features by t-test.

Yousefi, M., and E. R. Dougherty, Performance Reproducibility Index for Classification, Bioinformatics, 28(21), 2824-2833, 2012.

*gsp.tamu.eduReproducibility with Reporting BiasReproducibility index for m = 5 data sets, LDA, 5F-CV, 5 features, Gaussian with equal covariance matrices, uncorrelated features(a) n = 60, = 0.0005; (b) n = 60, = 0.05; (c) n = 120, = 0.0005; (d) n = 120, = 0.05;

*

gsp.tamu.eduClass sizes, n0 and n1, pre-determined Hence, no estimate for class prior probability c = P(Y = 0).Random sampling, r = n0/n c, n (prob)Fix c, expected error for r (QDA)Dark blue (c =0.3), black (c = 0.4), light blue (c = 0.5), red (c = 0.6), green (c = 0.7)r* (crossing point) is minimax valueTop equal covariance; bottom unequal covariance

Esfahani, M. S., and E. R. Dougherty, Effect of Separate Sampling on Classification Accuracy, Bionformatics,Separate Sampling: Classifier Error

*

gsp.tamu.eduClass sizes, n0 and n1, pre-determined. Apply classical 5-fold cross-validation on the data set to estimate the error (dashed lines).Apply separate-sampling 5-fold cross-validation (solid lines).Fix c, Bias for r (L-SVM).Dark blue (c =0.3), black (c = 0.4), light blue (c = 0.5), red (c = 0.6), green (c = 0.7)Top n = 80; bottom n = 1000

Braga-Neto, U. L., Zollanvari, U. M., and E. R. Dougherty, Cross-Validation Under Separate Sampling: Optimistic Bias and How to Correct It, Separate Sampling: Error Estimation

*

Apparent Patterns in Microarray DatapatternsRelationship?

What Does This Mean? Data are clustered by some clustering algorithm. Is there scientific knowledge here?

An algorithm that partitions a set of points into several groups, based on a measure of similarity (or dissimilarity) between the points. Example:Clustering Algorithm

Expression Profile ClusteringCluster expression vectors: clusters indicate potential co-regulation in time-course data analysis.Cluster samples: clusters indicate potential similar sources a sort of classification.MethodsFuzzy c-meansK-meansS.O.M.Hierarchical clustering (Euclidean distance)Hierarchical clustering (correlation)

K-means ClusteringGoal: Partition points into tight clusters.

Algorithm:Randomly initialize with k means m1,, mkPlace x into Ci if ||x mi|| ||x mj|| for j = 1,, kUpdate m1,, mk as the means of C1,, CkRepeat until means do not changeClusters determined by Voronoi diagram of m1,, mk

Hierarchical ClusteringIteratively join clusters based on similarity measure (agglomerative clustering).Farthest neighbor similarity measure:

d(Ci, Cj) = max {||x y|| : x Ci, y Cj}

Algorithm (complete linkage clustering):Initialize clusters by Ci = {xi} for i = 1,, nIteratively merge the clusters for which the greatest distance between points in the two clusters is minimizedHalts when the similarity measure exceeds a pre-defined threshold

Hierarchical Clustering ExampleA. cholesterol biosynthesisB. cell cycleC. immediate-early responseD. signaling and angiogenesisE. wound healing and tissue remodeling

Source: Michael B. Eisen, et al., PNAS 1998, Vol.95

The Clustering ProblemJain et al.: Clustering is a subjective process; the same set of data items often needs to be partitioned differently for different applications.

Jain, A.K., Murty, M. N., and P.J. Flynn, Data Clustering: A Review, ACM Computer Surveys, 31 (3), 264-323, 1999.

SolutionMathematical theoryPattern recognition theory and random set theory

Example: 2 or 3 clusters? What is the best separation?What Are Good Clusters?

Naive Clustering ErrorGenerate set of points from different distributions: A1,, Ak. Use clustering algorithm to form clusters: C1,, Ck. Align point sets and clusters, and count errors.Average over a number of randomly generated sets.

Dougherty, E. R. , Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y., Bittner, M. L., and J. M. Trent, "Inference From Clustering with Application to Gene-Expression Microarrays," Computational Biology 9 (1), 105-126, 2002.

Synthetic Example5 synthetic templatesSimulated data from the templatesdifferent variances5 different clustering methods

Single Experiment (s2 = 0.25)No error!Tighter clusters due to small varianceResults from fuzzy c-means

Experiment (s2 = 3.0)many misclassificationsclusters start mixing22 misclassifications(8.8%)

Hierarchical Clustering Error!!!Before clusteringAfter clustering with a NICE dendrogram24.5% Error!!Algorithm: Hierarchical clustering with correlation measure

Clustering ErrorPoints are a realization S of a labeled random point process.Clustering algorithm assigns to S a label function S. The error of is the expected difference between its labels and the labels generated by the point process.Error must take into account that we do not care about the ordering, only the partitions generated.Expectation taken with respect to the distribution of the point process.

Example of Clustering ErrorLeft: Realization of point processRight: Output of hierarchical clusteringError: 40%

Clustering ValidityClustering validity is analogous to classification validity.

Replace classifier with cluster operator and classification error with clustering error.

Validation IndicesValidation indices are meant to judge the validity of a clustering output.They can be based on a number of heuristic considerations and methodologies.Do they correspond to scientific validity?Does a validation index correlate to clustering error?

Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., and E. R. Dougherty, Model-Based Evaluation of Clustering Validation Measures, Pattern Recognition, 40 (3), 807-824, 2007.

Kendalls Correlation for IndicesTop: Realization of point processBottom: Kendalls correlation:Dunns index, D correl, silhouette, figure of merit

Kendalls Correlation for IndicesTop: Realization of point processBottom: Kendalls correlation: Dunns index, D correl, silhouette, figure of merit

Scientific KnowledgeRequires a mathematical model.In classification, the model is learned from training data.Requires a methodology to test the model.Can inferences be made from the model?

Classification and Knowledge

The model is composed of a classifier (decision function) and an error a data point is observed and it is assigned to a class.

The model is inferred from data by classification and error-estimation rules.

Model validity is determined by properties of the error estimation rule.

Probabilistic Theory of ClusteringClustering theory in the context of random sets.

Probabilistic error measure based on points being clustered correctly.

Bayes clusterer (optimal clustering algorithm).

Learning theory for clustering algorithms.

Dougherty, E. R., and M. Brun, A Probabilistic Theory of Clustering, Pattern Recognition, 37 (5), 917-925, 2004.

gsp.tamu.eduData Mining Violates Basic Principles Data mining violates two basic principles of experimental design: (1) constrain the variables so that the experiment is only minimally affected by external conditions and the results elucidate clear mathematically describable behavior; and (2) all modeling is done within a rigorous statistical setting in which both constraints and the sampling distribution are clearly expressed.

*

gsp.tamu.eduWhat Data Mining Has Produced Absent a sound epistemology there is no ground of knowledge and therefore no knowledge.There are thousands of papers in the literature for which there is no demonstration of any meaning at all.This has several serious consequences:Huge waste of resources.Literature is untrustworthy and much of it is useless.Propagation of meaningless results on meaningless results. Lack of progress on consequential problems.

*

gsp.tamu.eduIs Data Mining a Serious Scientific Endeavor Dougherty and Bittner (Epistemology of the Cell): Does anyone really believe that data mining could produce the general theory of relativity?

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

foundations 6 data minimg

Documents