dn 2017 | multi-paradigm data science - on the many dimensions of knowledge discovery | kai gansel |...

29
Multi-Paradigm Data Science On the many dimensions of Knowledge Discovery Data Natives, Berlin, November 17th, 2017 Dr. Kai Gansel ADDITIVE GmbH [email protected]

Upload: dataconomy-media

Post on 21-Jan-2018

85 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Multi-Paradigm Data ScienceOn the many dimensions of Knowledge Discovery

Data Natives, Berlin, November 17th, 2017

Dr. Kai GanselADDITIVE GmbH

[email protected]

Page 2: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Dimensions of Knowledge Discovery I and II: Data

���������� � �������� ������� �� � ��

���� ������ ��

high-dimensional data

low-dimensional data

bigdata

not-so

-bigdata

Statistics & Modeling

Data Mining

ClusterPC

ML & NN

2 ��� DN2017_Kai_ADDITIVE.nb

Page 3: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Dimensions of Knowledge Discovery III and IV: Approach

���������� �� � ��

���� ������

fuzzy question

exact question

exactdatafuzzydata

Statistics

Data Mining

ML & NN

DN2017_Kai_ADDITIVE.nb ���3

Page 4: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Dimensions of Knowledge Discovery V: Goal

��� �� � �������� ����������� ������ � ��������

Understanding

- Science -

Prediction

- Engineering -

Data Mining

Statistics

Modeling

Machine Learning

Neural Networks

Modeling

4 ��� DN2017_Kai_ADDITIVE.nb

Page 5: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Example: Statistics

Role of genetic variants in health and disease

(Kuehn, 2016)

DN2017_Kai_ADDITIVE.nb ���5

Page 6: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Correlation of SNPs with schizophrenic phenotypes

(Lencz et al., 2013)

6 ��� DN2017_Kai_ADDITIVE.nb

Page 7: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Special Topic: Higher order correlations

Definition

(Schneider & Grün, 2003)

An observed correlation between items or events is called genuine if it cannot be explained by correlations of lower order, i.e. by a random superposition of any of its constituent parts.

Meaning

Genuine higher order patterns are based on non-random, interacting processes and reflect the correlational structure of these processes. The appearances of such patterns may provide insights into

their hidden causes.

DN2017_Kai_ADDITIVE.nb ���7

Page 8: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

General task

W = region defining one data pointτ = class / feature / quality

Application areas: visited websites, market basket analysis...

...you name it!

The problem

Combinatorial explosion of the number of candidate patterns and tests with increasing number of dimensions:

n = 20; 2^n - n - 1

1048555

8 ��� DN2017_Kai_ADDITIVE.nb

Page 9: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Reducing the complexity of data: DimensionReduce

Advantages of dimensionality reduction:

◼ It reduces the time and storage space required.

◼ Removal of multi-collinearity improves the performance of any machine learning model.

◼ It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.

Here are some multi-dimensional example data:

data = Import[NotebookDirectory[] <> "Example.dat"];

Rearrange example data to represent individual measurements.

Structure of the data:

ListPlot[Tally[First /@ data], PlotRange → All, Filling → Axis, AxesLabel → {"Sort ID", "Number of measurements"}]

20 40 60 80 100Sort ID

10

20

30

40

50

Number of measurements

DN2017_Kai_ADDITIVE.nb ���9

Page 10: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

ListLinePlot[data[[346, 2]], PlotRange → All, AxesLabel → {"Mass ID", "Value"}, Epilog → Text["Measurement 346\nSort ID: " <> ToString[data[[346, 1]]], {14000, 6000}]]

5000 10000 15000 20000 25000Mass ID

2000

4000

6000

8000

10000

Value

Measurement 346

Sort ID: 55

Dimensions[Transpose[data][[2]]]

{346, 25780}

Project the data onto a 3-dimensional subspace:

data3D = DimensionReduce[Transpose[data][[2]], 3];

data3D = Get[NotebookDirectory[] <> "Data3D.txt"];

ListPlot[data3D[[346]], PlotRange → All, AxesLabel → {"Component", "Value"},Filling → Axis, Epilog → Text["Measurement 346\nSort ID: " <> ToString[data[[346, 1]]], {2, 20}]]

1.5 2.0 2.5 3.0Component

-20

-10

10

20

Value

Measurement 346

Sort ID: 55

Dimensions[data3D]

{346, 3}

10 ��� DN2017_Kai_ADDITIVE.nb

Page 11: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

ListPointPlot3D[data3D]

DN2017_Kai_ADDITIVE.nb ���11

Page 12: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Clustering and classifying data: ClusterClassify

ClusterClassify automatically determines the number of clusters and classifies the data accordingly:

Manipulate[With[{CC = ClusterClassify[data3D, Method → method][data3D]}, ListPointPlot3D[Map[Last, GatherBy[Transpose[{CC, data3D}], First], {2}],ImageSize → 500, PlotLegends → SwatchLegend[Union[CC], LegendLabel → "Cluster ID", LegendFunction → Panel, LegendMarkers → "SphereBubble"]]],

{method, {"GaussianMixture", "DBSCAN", "MeanShift", "Agglomerate", "NeighborhoodContraction"}}, SaveDefinitions → True]

������ ��������������� ������ ��������� ����������� �����������������������

Cluster ID

1

2

3

12 ��� DN2017_Kai_ADDITIVE.nb

Page 13: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Classifying data: Classify

Rock-paper-scissors

Click Reset.Hold up a fist in front of the camera. Click Rock. Change your hand to paper as you click the paper button, same for scissors. Capture 10-12 images of each. Click stop when you are done. Click Train and

wait. Click Watch and hold up some rock paper scissors gestures and it should recognize what you are doing.

Data = 0

�����

Capture: Rock Paper Scissors Watch Stop

�����

DN2017_Kai_ADDITIVE.nb ���13

Page 14: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Find the optimal parameters of a classifier

Load a dataset and split it into a training set and a test set.

data = RandomSample[ExampleData[{"MachineLearning", "Titanic"}, "Data"]];training = data[[ ;; 1000]];test = data[[1001 ;;]];

Define a function computing the performance of a classifier as a function of its (hyper)parameters.

loss[{c_, gamma_, b_, d_}] :=-ClassifierMeasurements[Classify[training, Method → {"SupportVectorMachine", "KernelType" → "Polynomial", "SoftMarginParameter" → Exp[c],

"GammaScalingParameter" → Exp[gamma], "BiasParameter" → Exp[b], "PolynomialDegree" → d}], test, "LogLikelihoodRate"];

Define the possible value of the parameters.

region = ImplicitRegion[And[-3. ≤ c ≤ 3., -3. ≤ gamma ≤ 3., -1. ≤ b ≤ 2., 1 ≤ d ≤ 3, d ∈ Integers], {c, gamma, b, d}]

Search for a good set of parameters.

bmo = BayesianMinimization[loss, region]

bmo["MinimumConfiguration"]

Train a classifier with these parameters.

Classify[training, Method → {"SupportVectorMachine", "KernelType" → "Polynomial", "SoftMarginParameter" → Exp[2.979837222482109`],"GammaScalingParameter" → Exp[-2.1506497693543025`], "BiasParameter" → Exp[-0.9038364134482837`], "PolynomialDegree" → 2}]

ClassifierMeasurements[%, test, "Accuracy"]

14 ��� DN2017_Kai_ADDITIVE.nb

Page 15: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Neural Networks: Digit classification

Use the MNIST database of handwritten digits to train a convolutional network to predict the digit given an image.

First obtain the training and validation data.

resource = ResourceObject["MNIST"];trainingData = ResourceData[resource, "TrainingData"];testData = ResourceData[resource, "TestData"];

RandomSample[trainingData, 5]

Define a convolutional neural network that takes in 28×28 grayscale images as input.

lenet = NetChain[{ConvolutionLayer[20, 5], Ramp, PoolingLayer[2, 2], ConvolutionLayer[50, 5], Ramp, PoolingLayer[2, 2], FlattenLayer[], 500, Ramp, 10, SoftmaxLayer[]},"Output" → NetDecoder[{"Class", Range[0, 9]}], "Input" → NetEncoder[{"Image", {28, 28}, "Grayscale"}]]

NetChain

�����������-������ (����� �����)

� ���������������� �-������ (����� ������)� ���� �-������ (����� ������)� ������������ �-������ (����� ������)� ���������������� �-������ (����� ����)� ���� �-������ (����� ����)� ������������ �-������ (����� ����)� ������������ ������ (����� ���)� ����������� ������ (����� ���)� ���� ������ (����� ���)�� ����������� ������ (����� ��)�� ������������ ������ (����� ��)

������ �����(�������������)

Train the network for one training round.

lenet = NetTrain[lenet, trainingData, ValidationSet → testData, MaxTrainingRounds → 1];

Evaluate the trained network directly on images randomly sampled from the validation set.

imgs = Keys@RandomSample[testData, 5];Thread[imgs → lenet[imgs]]

→ 4, → 0, → 6, → 7, → 2

DN2017_Kai_ADDITIVE.nb ���15

Page 16: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Create a ClassifierMeasurements object from the trained network and the validation set.

cm = ClassifierMeasurements[lenet, testData]

ClassifierMeasurementsObject ���������� ��������� �� ���� ��������� �����

Obtain the accuracy of the network on the validation set.

cm["Accuracy"]

0.9801

Obtain a plot of the confusion matrix of the network predictions on the validation set.

cm["ConfusionMatrixPlot"]

975

1150

1017

1013 979

882

947

1049 983

1005

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5 6 7 8 9

980

1135

1032

1010

982

892

958

1028

974

1009

predicted class

actualclass

963

0

1

0

0

1

5

1

2

2

0

1132

5

0

0

0

5

4

0

4

1

1

1001

2

2

1

0

7

2

0

0

1

2

992

0

9

0

2

3

4

1

0

2

0

964

0

3

0

2

7

0

0

0

4

0

874

3

0

0

1

3

0

2

0

1

2

936

0

2

1

3

1

14

7

1

1

1

1008

5

8

4

0

5

5

2

4

5

3

952

3

5

0

0

0

12

0

0

3

6

979

16 ��� DN2017_Kai_ADDITIVE.nb

Page 17: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Neural Networks: Unsupervised learning with autoencoders

Train an autoencoder network to reconstruct images of handwritten digits a�er projecting them to a lower-dimensional “code” vector space. Use these code vectors to perform clustering and visualiza-tion.

First obtain the training data, then select images corresponding to digits 0 through 4.

resource = ResourceObject["MNIST"];trainingData = ResourceData[resource, "TrainingData"];trainingSubset = Select[trainingData, Last[#] ≤ 4 &];testData = ResourceData[resource, "TestData"];testSubset = Select[testData, Last[#] ≤ 4 &];RandomSample[trainingSubset, 8]

→ 1, → 3, → 0, → 0, → 4, → 2, → 1, → 4

Obtain the “mean image” to subtract from the training data.

trainingImages = Keys[trainingSubset];meanImage = Image[Mean@Map[ImageData, trainingImages]]

Create a network to train that produces both the reconstruction and the reconstruction error.

DN2017_Kai_ADDITIVE.nb ���17

Page 18: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

net = NetGraph[{FlattenLayer[], 50, Ramp, 784, Tanh, ReshapeLayer[{1, 28, 28}], MeanSquaredLossLayer[]},{1 → 2 → 3 → 4 → 5 → 6 → NetPort["Output"], 6 → NetPort[7, "Input"], NetPort["Input"] → NetPort[7, "Target"]},"Input" → NetEncoder[{"Image", {28, 28}, "Grayscale", "MeanImage" → meanImage}], "Output" → NetDecoder[{"Image", "Grayscale"}]]

NetGraph

1 2 3 4 5 6 Output

7Input Loss

Input

Ramp Tanh784 50 50 784 784 1⨯ 28⨯ 28

1⨯28⨯28

1⨯ 28⨯ 28

1⨯ 28⨯ 28

FlattenLayer ReshapeLayer

LinearLayer MeanSquaredLossLayer

ElementwiseLayer

Train the network to minimize the reconstruction error.

trained4 = NetTrain[net, <|"Input" → trainingImages|>, "Loss"];

Obtain a subnetwork that performs only reconstruction.

reconstructor = Take[trained4, {NetPort["Input"], NetPort["Output"]}]

NetGraph 1 2 3 4 5 6 OutputInput

Ramp Tanh784 50 50 784 784 1⨯ 28⨯ 281⨯ 28⨯ 28

FlattenLayer ElementwiseLayer

LinearLayer ReshapeLayer

Reconstruct some sample images.

ImageAdd[reconstructor[#], meanImage] & /@ , , , ,

, , , ,

Obtain a subnetwork that produces the code vector.

18 ��� DN2017_Kai_ADDITIVE.nb

Page 19: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

encoder = Take[trained4, {NetPort["Input"], 4}]

NetGraph 1 2 3 4Input Output

784 50 501⨯ 28⨯ 28 784

FlattenLayer Ramp

LinearLayer

Compute codes for all of the test images.

testImages = Keys[testSubset];features = encoder[testImages];

Project the code vectors to three dimensions and visualize them along with the original classes (not seen by the network). The digit classes tend to cluster together.

coords = DimensionReduce[features, 3];classes = Values[testSubset];Table[Extract[coords, Position[classes, i]], {i, 0, 4}]ListPointPlot3D[Table[Extract[coords, Position[classes, i]], {i, 0, 4}], PlotLegends → PointLegend[96, Range[0, 4]],BoxRatios → 1, Axes → None, Boxed → True, PlotStyle → Map[ColorData[96], Range[1, 5]], AspectRatio → 1]

0

1

2

3

4

DN2017_Kai_ADDITIVE.nb ���19

Page 20: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Visualize a hierarchical clustering of random representatives from each class.

representatives = Catenate@GroupBy[testSubset, Last → First, RandomSample[#, 6] &];ClusteringTree[encoder[representatives] → Map[ImageCrop, representatives]]

20 ��� DN2017_Kai_ADDITIVE.nb

Page 21: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Neural Networks: Avoid overfitting using a hold-out set

Use the ValidationSet option to NetTrain to ensure that the trained net does not overfit the input data. This is commonly referred to as a test or hold-out dataset.

Create synthetic training data based on a Gaussian curve.

data = Table[x → Exp[-x^2] + RandomVariate[NormalDistribution[0, .15]], {x, -3, 3, .2}];

plot = ListPlot[List @@@ data, PlotStyle → Red]

-3 -2 -1 1 2 3

-0.2

0.2

0.4

0.6

0.8

1.0

Train a net with a large number of parameters relative to the amount of training data.

net = NetChain[{150, Tanh, 150, Tanh, 1}, "Input" → "Scalar", "Output" → "Scalar"];net1 = NetTrain[net, data, Method → "ADAM"]

NetChain

����������������� (����� �)

� ����������� ������ (����� ���)� ���� ������ (����� ���)� ����������� ������ (����� ���)� ���� ������ (����� ���)� ����������� ������ (����� �)

������ ������

The resulting net overfits the data, learning the noise in addition to the underlying function.

DN2017_Kai_ADDITIVE.nb ���21

Page 22: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Show[Plot[net1[x], {x, -3, 3}], plot]

-3 -2 -1 1 2 3

-0.2

0.2

0.4

0.6

0.8

1.0

Subdivide the data into a training set and a hold-out validation set.

data = RandomSample[data];{train, test} = TakeDrop[data, 24];

Use the ValidationSet option to have NetTrain select the net that achieved the lowest validation loss during training.

net2 = NetTrain[net, train, ValidationSet → test]

NetChain

����������������� (����� �)

� ����������� ������ (����� ���)� ���� ������ (����� ���)� ����������� ������ (����� ���)� ���� ������ (����� ���)� ����������� ������ (����� �)

������ ������

The result returned by NetTrain was the net that generalized best to points in the validation set, as measured by validation loss. This penalizes overfitting, as the noise present in the training data is

uncorrelated with the noise present in the validation set.

22 ��� DN2017_Kai_ADDITIVE.nb

Page 23: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Show[Plot[net2[x], {x, -3, 3}], plot]

-3 -2 -1 1 2 3

0.2

0.4

0.6

0.8

1.0

DN2017_Kai_ADDITIVE.nb ���23

Page 24: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Model-based Prediction

Train a Gaussian process predictor on a simple dataset.

data = {-1.2 → 1.2, 1.0 → 1.4, 2.2 → 1.6, 3.1 → 1.8, 4.5 → 1.6};p = Predict[data, Method → "GaussianProcess"]

Visualize the predicted values along with a confidence interval.

Show[Plot[{p[x], p[x] + StandardDeviation[p[x, "Distribution"]], p[x] - StandardDeviation[p[x, "Distribution"]]},{x, -2, 6}, PlotStyle → {Blue, Gray, Gray}, Filling → {2 → {3}}, Exclusions → False, PerformanceGoal → "Speed",PlotLegends → {"Prediction", "Confidence Interval"}], ListPlot[List @@@ data, PlotStyle → Red, PlotLegends → {"Data"}]]

-2 2 4 6

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Prediction

Confidence Interval

Data

24 ��� DN2017_Kai_ADDITIVE.nb

Page 25: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Dealing with complexity: Graph analysisImport a DIMACS file:

gr = Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "DIMACS"]

Get the metadata:

Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "Elements"]

{AdjacencyMatrix, EdgeRules, Graph, Graphics, VertexCount}

Edge rules:

rules = Import["http://mat.gsia.cmu.edu/COLOR/instances/homer.col", "EdgeRules"]

Most frequently occurring character:

Commonest[Flatten[List @@@ rules]]

{452}

Achilles neighborhood:

DN2017_Kai_ADDITIVE.nb ���25

Page 26: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

NeighborhoodGraph[gr, 452]

Highlight the subgraph:

26 ��� DN2017_Kai_ADDITIVE.nb

Page 27: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

HighlightGraph[gr, NeighborhoodGraph[gr, 452], ImageSize → Large]

DN2017_Kai_ADDITIVE.nb ���27

Page 28: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Conclusion◼ Don’t restrict yourself to any particular approach or method without need!

◼ Don’t imply the answer when defining a question!

◼ Stay curious!

28 ��� DN2017_Kai_ADDITIVE.nb

Page 29: DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge Discovery | Kai Gansel | Additive

Thanks for listening!

For questions and suggestions, contact [email protected]://www.additive-mathematica.de

ADDITIVE So�- und Hardware für Technik und Wissenscha� GmbHMax-Planck-Staße 22b, 61381 FriedrichsdorfSales: 06172 - 5905 - 30 // [email protected]: 06172 - 5905 - 90 // [email protected]: 06172 - 5905 - 20 // [email protected]

DN2017_Kai_ADDITIVE.nb ���29