dn 2017 | multi-paradigm data science - on the many dimensions of knowledge discovery | kai gansel |...

Multi-Paradigm Data ScienceOn the many dimensions of Knowledge Discovery

Data Natives, Berlin, November 17th, 2017

Dr. Kai GanselADDITIVE GmbH

[email protected]

Dimensions of Knowledge Discovery I and II: Data

��

��

high-dimensional data

low-dimensional data

bigdata

not-so

-bigdata

Statistics & Modeling

Data Mining

ClusterPC

ML & NN

2 �� DN2017_Kai_ADDITIVE.nb

Dimensions of Knowledge Discovery III and IV: Approach

��

��

fuzzy question

exact question

exactdatafuzzydata

Statistics

Data Mining

ML & NN

DN2017_Kai_ADDITIVE.nb ��3

Dimensions of Knowledge Discovery V: Goal

��

Understanding

- Science -

Prediction

- Engineering -

Data Mining

Statistics

Modeling

Machine Learning

Neural Networks

Modeling


Example: Statistics

Role of genetic variants in health and disease

(Kuehn, 2016)


Correlation of SNPs with schizophrenic phenotypes

(Lencz et al., 2013)


Special Topic: Higher order correlations

Definition

(Schneider & Grün, 2003)

An observed correlation between items or events is called genuine if it cannot be explained by correlations of lower order, i.e. by a random superposition of any of its constituent parts.

Meaning

Genuine higher order patterns are based on non-random, interacting processes and reflect the correlational structure of these processes. The appearances of such patterns may provide insights into

their hidden causes.


General task

W = region defining one data pointτ = class / feature / quality

Application areas: visited websites, market basket analysis...

...you name it!

The problem

Combinatorial explosion of the number of candidate patterns and tests with increasing number of dimensions:

n = 20; 2^n - n - 1

1048555


Reducing the complexity of data: DimensionReduce

Advantages of dimensionality reduction:

◼ It reduces the time and storage space required.

◼ Removal of multi-collinearity improves the performance of any machine learning model.

◼ It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.

Here are some multi-dimensional example data:

data = Import[NotebookDirectory[] <> "Example.dat"];

Rearrange example data to represent individual measurements.

Structure of the data:

ListPlot[Tally[First /@ data], PlotRange → All, Filling → Axis, AxesLabel → {"Sort ID", "Number of measurements"}]

20 40 60 80 100Sort ID

10

20

30

40

50

Number of measurements


ListLinePlot[data[[346, 2]], PlotRange → All, AxesLabel → {"Mass ID", "Value"}, Epilog → Text["Measurement 346\nSort ID: " <> ToString[data[[346, 1]]], {14000, 6000}]]

5000 10000 15000 20000 25000Mass ID

2000

4000

6000

8000

10000

Value

Measurement 346

Sort ID: 55

Dimensions[Transpose[data][[2]]]

{346, 25780}

Project the data onto a 3-dimensional subspace:

data3D = DimensionReduce[Transpose[data][[2]], 3];

data3D = Get[NotebookDirectory[] <> "Data3D.txt"];

ListPlot[data3D[[346]], PlotRange → All, AxesLabel → {"Component", "Value"},Filling → Axis, Epilog → Text["Measurement 346\nSort ID: " <> ToString[data[[346, 1]]], {2, 20}]]

1.5 2.0 2.5 3.0Component

-20

-10

10

20

Value

Measurement 346

Sort ID: 55

Dimensions[data3D]

{346, 3}


ListPointPlot3D[data3D]


Clustering and classifying data: ClusterClassify

ClusterClassify automatically determines the number of clusters and classifies the data accordingly:

Manipulate[With[{CC = ClusterClassify[data3D, Method → method][data3D]}, ListPointPlot3D[Map[Last, GatherBy[Transpose[{CC, data3D}], First], {2}],ImageSize → 500, PlotLegends → SwatchLegend[Union[CC], LegendLabel → "Cluster ID", LegendFunction → Panel, LegendMarkers → "SphereBubble"]]],

{method, {"GaussianMixture", "DBSCAN", "MeanShift", "Agglomerate", "NeighborhoodContraction"}}, SaveDefinitions → True]

��

Cluster ID

1

2

3


Classifying data: Classify

Rock-paper-scissors

Click Reset.Hold up a fist in front of the camera. Click Rock. Change your hand to paper as you click the paper button, same for scissors. Capture 10-12 images of each. Click stop when you are done. Click Train and

wait. Click Watch and hold up some rock paper scissors gestures and it should recognize what you are doing.

Data = 0

��

Capture: Rock Paper Scissors Watch Stop

��


Find the optimal parameters of a classifier

Load a dataset and split it into a training set and a test set.

data = RandomSample[ExampleData[{"MachineLearning", "Titanic"}, "Data"]];training = data[[ ;; 1000]];test = data[[1001 ;;]];

Define a function computing the performance of a classifier as a function of its (hyper)parameters.

loss[{c_, gamma_, b_, d_}] :=-ClassifierMeasurements[Classify[training, Method → {"SupportVectorMachine", "KernelType" → "Polynomial", "SoftMarginParameter" → Exp[c],

"GammaScalingParameter" → Exp[gamma], "BiasParameter" → Exp[b], "PolynomialDegree" → d}], test, "LogLikelihoodRate"];

Define the possible value of the parameters.

region = ImplicitRegion[And[-3. ≤ c ≤ 3., -3. ≤ gamma ≤ 3., -1. ≤ b ≤ 2., 1 ≤ d ≤ 3, d ∈ Integers], {c, gamma, b, d}]

Search for a good set of parameters.

bmo = BayesianMinimization[loss, region]

bmo["MinimumConfiguration"]

Train a classifier with these parameters.

Classify[training, Method → {"SupportVectorMachine", "KernelType" → "Polynomial", "SoftMarginParameter" → Exp[2.979837222482109`],"GammaScalingParameter" → Exp[-2.1506497693543025`], "BiasParameter" → Exp[-0.9038364134482837`], "PolynomialDegree" → 2}]

ClassifierMeasurements[%, test, "Accuracy"]


Neural Networks: Digit classification

Use the MNIST database of handwritten digits to train a convolutional network to predict the digit given an image.

First obtain the training and validation data.

resource = ResourceObject["MNIST"];trainingData = ResourceData[resource, "TrainingData"];testData = ResourceData[resource, "TestData"];

RandomSample[trainingData, 5]

Define a convolutional neural network that takes in 28×28 grayscale images as input.

lenet = NetChain[{ConvolutionLayer[20, 5], Ramp, PoolingLayer[2, 2], ConvolutionLayer[50, 5], Ramp, PoolingLayer[2, 2], FlattenLayer[], 500, Ramp, 10, SoftmaxLayer[]},"Output" → NetDecoder[{"Class", Range[0, 9]}], "Input" → NetEncoder[{"Image", {28, 28}, "Grayscale"}]]

NetChain

��-�� (�� ×��×��)

� �� -�� (�� ×��×��)� �� -�� (�� ×��×��)� �� -�� (�� ×��×��)� �� -�� (�� ×�×�)� �� -�� (�� ×�×�)� �� -�� (�� ×�×�)� �� (�� )� �� (�� )� �� (�� )�� (�� )�� (�� )

�� (��)

Train the network for one training round.

lenet = NetTrain[lenet, trainingData, ValidationSet → testData, MaxTrainingRounds → 1];

Evaluate the trained network directly on images randomly sampled from the validation set.

imgs = Keys@RandomSample[testData, 5];Thread[imgs → lenet[imgs]]

→ 4, → 0, → 6, → 7, → 2


Create a ClassifierMeasurements object from the trained network and the validation set.

cm = ClassifierMeasurements[lenet, testData]

ClassifierMeasurementsObject ��

Obtain the accuracy of the network on the validation set.

cm["Accuracy"]

0.9801

Obtain a plot of the confusion matrix of the network predictions on the validation set.

cm["ConfusionMatrixPlot"]

975

1150

1017

1013 979

882

947

1049 983

1005

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5 6 7 8 9

980

1135

1032

1010

982

892

958

1028

974

1009

predicted class

actualclass

963

0

1

0

0

1

5

1

2

2

0

1132

5

0

0

0

5

4

0

4

1

1

1001

2

2

1

0

7

2

0

0

1

2

992

0

9

0

2

3

4

1

0

2

0

964

0

3

0

2

7

0

0

0

4

0

874

3

0

0

1

3

0

2

0

1

2

936

0

2

1

3

1

14

7

1

1

1

1008

5

8

4

0

5

5

2

4

5

3

952

3

5

0

0

0

12

0

0

3

6

979


Neural Networks: Unsupervised learning with autoencoders

Train an autoencoder network to reconstruct images of handwritten digits a�er projecting them to a lower-dimensional “code” vector space. Use these code vectors to perform clustering and visualiza-tion.

First obtain the training data, then select images corresponding to digits 0 through 4.

resource = ResourceObject["MNIST"];trainingData = ResourceData[resource, "TrainingData"];trainingSubset = Select[trainingData, Last[#] ≤ 4 &];testData = ResourceData[resource, "TestData"];testSubset = Select[testData, Last[#] ≤ 4 &];RandomSample[trainingSubset, 8]

→ 1, → 3, → 0, → 0, → 4, → 2, → 1, → 4

Obtain the “mean image” to subtract from the training data.

trainingImages = Keys[trainingSubset];meanImage = Image[Mean@Map[ImageData, trainingImages]]

Create a network to train that produces both the reconstruction and the reconstruction error.


net = NetGraph[{FlattenLayer[], 50, Ramp, 784, Tanh, ReshapeLayer[{1, 28, 28}], MeanSquaredLossLayer[]},{1 → 2 → 3 → 4 → 5 → 6 → NetPort["Output"], 6 → NetPort[7, "Input"], NetPort["Input"] → NetPort[7, "Target"]},"Input" → NetEncoder[{"Image", {28, 28}, "Grayscale", "MeanImage" → meanImage}], "Output" → NetDecoder[{"Image", "Grayscale"}]]

NetGraph

1 2 3 4 5 6 Output

7Input Loss

Input

Ramp Tanh784 50 50 784 784 1⨯ 28⨯ 28

1⨯28⨯28

1⨯ 28⨯ 28

1⨯ 28⨯ 28

ℝ

FlattenLayer ReshapeLayer

LinearLayer MeanSquaredLossLayer

ElementwiseLayer

Train the network to minimize the reconstruction error.

trained4 = NetTrain[net, <|"Input" → trainingImages|>, "Loss"];

Obtain a subnetwork that performs only reconstruction.

reconstructor = Take[trained4, {NetPort["Input"], NetPort["Output"]}]

NetGraph 1 2 3 4 5 6 OutputInput

Ramp Tanh784 50 50 784 784 1⨯ 28⨯ 281⨯ 28⨯ 28

FlattenLayer ElementwiseLayer

LinearLayer ReshapeLayer

Reconstruct some sample images.

ImageAdd[reconstructor[#], meanImage] & /@ , , , ,

, , , ,

Obtain a subnetwork that produces the code vector.


encoder = Take[trained4, {NetPort["Input"], 4}]

NetGraph 1 2 3 4Input Output

784 50 501⨯ 28⨯ 28 784

FlattenLayer Ramp

LinearLayer

Compute codes for all of the test images.

testImages = Keys[testSubset];features = encoder[testImages];

Project the code vectors to three dimensions and visualize them along with the original classes (not seen by the network). The digit classes tend to cluster together.

coords = DimensionReduce[features, 3];classes = Values[testSubset];Table[Extract[coords, Position[classes, i]], {i, 0, 4}]ListPointPlot3D[Table[Extract[coords, Position[classes, i]], {i, 0, 4}], PlotLegends → PointLegend[96, Range[0, 4]],BoxRatios → 1, Axes → None, Boxed → True, PlotStyle → Map[ColorData[96], Range[1, 5]], AspectRatio → 1]

0

1

2

3

4


Visualize a hierarchical clustering of random representatives from each class.

representatives = Catenate@GroupBy[testSubset, Last → First, RandomSample[#, 6] &];ClusteringTree[encoder[representatives] → Map[ImageCrop, representatives]]


Neural Networks: Avoid overfitting using a hold-out set

Use the ValidationSet option to NetTrain to ensure that the trained net does not overfit the input data. This is commonly referred to as a test or hold-out dataset.

Create synthetic training data based on a Gaussian curve.

data = Table[x → Exp[-x^2] + RandomVariate[NormalDistribution[0, .15]], {x, -3, 3, .2}];

plot = ListPlot[List @@@ data, PlotStyle → Red]

-3 -2 -1 1 2 3

-0.2

0.2

0.4

0.6

0.8

1.0

Train a net with a large number of parameters relative to the amount of training data.

net = NetChain[{150, Tanh, 150, Tanh, 1}, "Input" → "Scalar", "Output" → "Scalar"];net1 = NetTrain[net, data, Method → "ADAM"]

NetChain

�� (�� )

� �� (�� )� �� (�� )� �� (�� )� �� (�� )� �� (�� )

��

The resulting net overfits the data, learning the noise in addition to the underlying function.


Show[Plot[net1[x], {x, -3, 3}], plot]

-3 -2 -1 1 2 3

-0.2

0.2

0.4

0.6

0.8

1.0

Subdivide the data into a training set and a hold-out validation set.

data = RandomSample[data];{train, test} = TakeDrop[data, 24];

Use the ValidationSet option to have NetTrain select the net that achieved the lowest validation loss during training.

net2 = NetTrain[net, train, ValidationSet → test]

NetChain

�� (�� )

� �� (�� )� �� (�� )� �� (�� )� �� (�� )� �� (�� )

��

The result returned by NetTrain was the net that generalized best to points in the validation set, as measured by validation loss. This penalizes overfitting, as the noise present in the training data is

uncorrelated with the noise present in the validation set.


Show[Plot[net2[x], {x, -3, 3}], plot]

-3 -2 -1 1 2 3

0.2

0.4

0.6

0.8

1.0


Model-based Prediction

Train a Gaussian process predictor on a simple dataset.

data = {-1.2 → 1.2, 1.0 → 1.4, 2.2 → 1.6, 3.1 → 1.8, 4.5 → 1.6};p = Predict[data, Method → "GaussianProcess"]

Visualize the predicted values along with a confidence interval.

Show[Plot[{p[x], p[x] + StandardDeviation[p[x, "Distribution"]], p[x] - StandardDeviation[p[x, "Distribution"]]},{x, -2, 6}, PlotStyle → {Blue, Gray, Gray}, Filling → {2 → {3}}, Exclusions → False, PerformanceGoal → "Speed",PlotLegends → {"Prediction", "Confidence Interval"}], ListPlot[List @@@ data, PlotStyle → Red, PlotLegends → {"Data"}]]

-2 2 4 6

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Prediction

Confidence Interval

Data


Dealing with complexity: Graph analysisImport a DIMACS file:

gr = Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "DIMACS"]

Get the metadata:

Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "Elements"]

{AdjacencyMatrix, EdgeRules, Graph, Graphics, VertexCount}

Edge rules:

rules = Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "EdgeRules"]

Most frequently occurring character:

Commonest[Flatten[List @@@ rules]]

{452}

Achilles neighborhood:


NeighborhoodGraph[gr, 452]

Highlight the subgraph:


HighlightGraph[gr, NeighborhoodGraph[gr, 452], ImageSize → Large]


Conclusion◼ Don’t restrict yourself to any particular approach or method without need!

◼ Don’t imply the answer when defining a question!

◼ Stay curious!


Thanks for listening!

For questions and suggestions, contact [email protected]://www.additive-mathematica.de

ADDITIVE So�- und Hardware für Technik und Wissenscha� GmbHMax-Planck-Staße 22b, 61381 FriedrichsdorfSales: 06172 - 5905 - 30 // [email protected]: 06172 - 5905 - 90 // [email protected]: 06172 - 5905 - 20 // [email protected]


dn 2017 | multi-paradigm data science - on the many dimensions of knowledge discovery | kai gansel |...

Data & Analytics