meta -learning : the future of data miningduch/ref/10/barcelona/wcci-meta-learning-nj.pdf · plan...

Meta-Learning:

the future of data mining

Włodzisław DuchNorbert Jankowski, Krzysztof Grąbczewski

+ Tomasz Maszczyk + Marek Grochowski

Department of Informatics, Nicolaus Copernicus University, Toruń, Poland

Google: W. DuchWCCI 2010, Barcelona

Norbert Tomek Marek Krzysztof

Plan

• Problems with Computational intelligence (CI)

• Problems with current approaches to data mining/pattern recognition.

• Meta-learning as search in the space of all models.

• First attempt: similarity based framework for metalearning.

• Heterogeneous systems.

• Hard problems and support features.

• More components to build algorithms.

• Real meta-learning, or algorithms on demand.

What is there to learn?

Brains ... what is in EEG? What happens in the brain?

Industry: what happens with our machines?

Cognitive robotics: vision, perception, language.

Bioinformatics, life sciences.

What can we learn?

What can we learn using pattern recognition, machine lerning, computational intelligence techniques ? Everything?

Neural networks are universal approximators and evolutionary algorithms solve global optimization problems – so everything can be learned? Not at all! All non-trivial problems are hard, need deep transformations.

Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems:

• Uniformly averaged over all target functions the expected error for all learning algorithms [predictions by economists] is the same.

• Averaged over all target functions no learning algorithm yields generalization error that is superior to any other.

• There is no problem-independent or “best” set of features.

“Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.”

In practice: try as many models as you can, rely on your experience and intuition. There is no free lunch, but do we have to cook ourselves?

Data mining packages

• No free lunch => provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees, tools for visualization of data.

• Support the process of knowledge discovery/model building and evaluating, organizing it into projects.

• Many other interesting DM packages of this sort exists: Weka, Yale, Orange, Knime ... 168 packages on the-data-mine.com list!

• We are building Intemi, radically new DM tools.

GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/

• Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GM Developer & Analyzer

What DM packages do?

Hundreds of components ... transforming, visualizing ...

Rapid Miner 5.0, type and # components

Process control 34

Data transformations 111

Data modeling 231

Clustering & segmentation 19

Performance evaluation 30

Text, series, web ... specific transformations.Visualization, presentation, plugin extensions ... ~ billions of models!

Visual “knowledge flow” to link components, or script languages (XML) to define complex experiments.

With all these tools, are we

really so good?

Surprise!

Almost nothing can be learned using such tools!

May the force be with you

Hundreds of components ... billions of combinations ...

Our treasure box is full! We can publish forever! Specialized transformations are still missing in many packages.

What would we really like to have?

Press the button and wait for the truth!

Computer power is with us, meta-learning should replace us in find all interesting data models =sequences of transformations/procedures.

Many considerations: optimal cost solutions, various costs of using feature subsets; simple & easy to understand vs optimal accuracy; various representation of knowledge: crisp, fuzzy or prototype rules, visualization, confidence in predictions ...

Meta-learning

Meta-learning means different things for different people.

Some will call “meta” learning of many models, ranking them, boosting, bagging, or creating an ensemble in many ways , so heremeta optimization of parameters to integrate models.

Landmarking: characterize many datasets and remember which method worked the best on each dataset. Compare new dataset to the reference ones; define various measures (not easy) and use similarity-based methods.

Regression models: created for each algorithm on parameters that describe data to predict expected accuracy, ranking potentially useful algorithms.

Stacking: learn new models on errors of the previous ones.

Deep learning: DARPA 2009 call, methods are „flat”, shallow, build a universal machine learning engine that generates progressively more sophisticated representations of patterns, invariants, correlations from data.Rather limited success so far …

Meta-learning: learning how to learn.

Similarity-based framework

(Dis)similarity: • more general than feature-based description, • no need for vector spaces (structured objects), • more general than fuzzy approach (F-rules are reduced to P-rules), • includes nearest neighbor algorithms, MLPs, RBFs, separable function

networks, SVMs, kernel methods, specialized kernels, and many others!

Similarity-Based Methods (SBMs) are organized in a framework: p(Ci|X;M) posterior classification probability or y(X;M) approximators,models M are parameterized in increasingly sophisticated way.

A systematic search (greedy, beam, evolutionary) in the space of all SBM models is used to select optimal combination of parameters and procedures, opening different types of optimization channels, trying to discover appropriate bias for a given problem.

Results: several candidate models are created, even very limited version gives best results in 7 out of 12 Stalog problems.

SBM framework components

• Pre-processing: objects O => features X, or (diss)similarities D(O,O’). • Calculation of similarity between features d(xi,yi) and objects D(X,Y).• Reference (or prototype) vector R selection/creation/optimization. • Weighted influence of reference vectors G(D(Ri,X)), i=1..k.• Functions/procedures to estimate p(C|X;M) or y(X;M).• Cost functions E[DT;M] and model selection/validation procedures. • Optimization procedures for the whole model Ma.• Search control procedures to create more complex models Ma+1.• Creation of ensembles of (local, competent) models.

• M={X(O), d(.,.), D(.,.), k, G(D), {R}, {pi(R)}, E[.], K(.), S(.,.)}, where:• S(Ci,Cj) is a matrix evaluating similarity of the classes;

a vector of observed probabilities pi(X) instead of hard labels.

The kNN model p(Ci|X;kNN) = p(Ci|X;k,D(.),{DT}); the RBF model: p(Ci|X;RBF) = p(Ci|X;D(.),G(D),{R}), MLP, SVM and many other models may all be “re-discovered” as a part of SBF.

Meta-learning in SBM scheme

Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.

k-NN 67.5/76.6%

+d(x,y);

Canberra 89.9/90.7 %

+ si=(0,0,1,0,1,1);

71.6/64.4 %

+selection,

67.5/76.6 %

+k opt; 67.5/76.6 %

+d(x,y) + si=(1,0,1,0.6,0.9,1);


+d(x,y) + selection;


k-NN 67.5/76.6%

+d(x,y);


+ si=(0,0,1,0,1,1);

71.6/64.4 %

+selection,

67.5/76.6 %

+k opt; 67.5/76.6 %

+d(x,y) + si=(1,0,1,0.6,0.9,1);


+d(x,y) + sel. or opt k;


k-NN 67.5/76.6%

+d(x,y);


+ si=(0,0,1,0,1,1);

71.6/64.4 %

+ranking,

67.5/76.6 %

+k opt; 67.5/76.6 %

+d(x,y) + si=(1,0,1,0.6,0.9,1);




Meta-learning in SBM scheme

Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.

k-NN 67.5/76.6%

+d(x,y);


+ si=(0,0,1,0,1,1);

71.6/64.4 %

+selection,

67.5/76.6 %

+k opt; 67.5/76.6 %

+d(x,y) + si=(1,0,1,0.6,0.9,1);




Real meta-learning!

Meta-learning: learning how to learn, replace experts who search for best models making a lot of experiments.

Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search.

• Abstract view, uniform I/O, uniform results management.

• Directed acyclic graphs (DAG) of boxes representing scheme

• placeholders and particular models, interconnected through I/O.

• Configuration level for meta-schemes, expanded at runtime level.

An exercise in software engineering for data mining!

Intemi, Intelligent Miner

Meta-schemes: templates with placeholders.

• May be nested; the role decided by the input/output types.

• Machine learning generators based on meta-schemes.

• Granulation level allows to create novel methods.

• Complexity control: Length + log(time)

• A unified meta-parameters description, defining the range of sensible values and the type of the parameter changes.

Advanced meta-learning

• Extracting meta-rules, describing interesting search directions.

• Finding the correlations occurring among different items in most accurate results, identifying different machine (algorithmic) structures with similar behavior in an area of the model space.

• Depositing the knowledge they gain in a reusable meta-knowledge repository (for meta-learning experience exchange between different meta-learners).

• A uniform representation of the meta-knowledge, extending expert knowledge, adjusting the prior knowledge according to performed tests.

• Finding new successful complex structures and converting them into meta-schemes (which we call meta abstraction) by replacing proper substructures by placeholders.

• Beyond transformations & feature spaces: actively search for info.

Intemi software (N. Jankowski and K. Grąbczewski) incorporating these ideas and more is coming “soon” ...

Meta-learning architecture

Inside meta-parameter search a repeater machine composed of

distribution and test schemes are placed.

Generating machines

Search process is controlled by a variant of approximated Levin’s complexity: estimation of program complexity combined with time. Simpler machines are evaluated first, machines that work too long (approximations may be wrong) are put into quarantine.

Pre-compute what you can

and use “machine unification” to get substantial savings!

Complexities on vowel data

……………

Simple machines on vowel data

Left: final ranking, gray bar=accuracy, small bars: memory, time & total complexity, middle numbers = process id (models in previous table).

Complex machines on vowel data

Left: final ranking, gray bar=accuracy, small bars: memory, time & total complexity, middle numbers = process id (models in previous table).

Principles: information compression

Neural information processing in perception and cognition: information compression, or algorithmic complexity. In computing: minimum length (message, description) encoding.

Wolff (2006): all cognition and computation is information compression! Analysis and production of natural language, fuzzy pattern recognition, probabilistic reasoning and unsupervised inductive learning.

Talks about multiple alignment, unification and search, but

so far only models for sequential data and 1D alignment.

Information compression: encoding new information in terms of old has been used to define the measure of syntactic and semantic information (Duch, Jankowski 1994); based on the size of the minimal graph representing a given data structure or knowledge-base specification, thus it goes beyond alignment.

Knowledge transfer

Brains learn new concepts in terms of old; use large semantic network and add new concepts linking them to the known.

Knowledge should be transferred between the tasks. Not just learned from a single dataset.

Need to discover good building blocks for higher level concepts/features.

Maximization of margin/regularization

Among all discriminating hyperplanes there is one defined by support vectors that is clearly better.

LDA in larger space

Suppose that strongly non-linear borders are needed.

Use LDA, but add new dimensions, functions of your inputs!

Add to input Xi2, and products Xi Xj, as new features.

Example: 2D => 5D case Z={z1…z5}={X1, X2, X12, X2

2, X1X2}

The number of such tensor products grows exponentially – no good.

Fig. 4.1

Hasti et al.

Kernels = similarity functions

Gaussian kernels in SVM: zi (X)=G(X;XI ,s) radial features, X=>ZGaussian mixtures are close to optimal Bayesian errors. Solution requires continuous deformation of decision borders and is therefore rather easy.

Support Feature Machines (SFM): construct features based on projections, restricted linear combinations, kernel features, use feature selection.

Gaussian kernel, C=1.In the kernel space Z decision borders are flat, but in the X space highly non-linear!

SVM is based on quadratic solver, without explicit features, but using Z features explicitly has some advantages: Multiresolution: different s for different support features, or using several kernelszi (X)=K(X;XI ,s) in one set of features.Linear solvers, or Naïve Bayes, or any other algorithms may be used.

Neural networks: thyroid screening

Garavan Institute, Sydney, Australia

15 binary, 6 continuous

Training: 93+191+3488 Validate: 73+177+3178

• Determine importantclinical factors

• Calculate prob. of each diagnosis.

Hidden

units

Final

diagnoses

TSH

T4U

Clinical

findings

Age

sex

…

…

T3

TT4

TBG

Normal

Hyperthyroid

Hypothyroid

Poor results of SBL or SVM … see summary at this addresshttp://www.is.umk.pl/projects/datasets.html#Hypothyroid

SVNT algorithm

Initialize the network parameters W, set De=0.01, emin=0, set SV=T (training set).

Until no improvement is found in the last Nlast iterations do:

• Optimize network parameters for Nopt steps on SV data.

• Run feedforward step on SV to determine overall accuracy and errors, make new SV={X | e(X) [emin,1-emin]}.

• If the accuracy increases:

compare current network with the previous best one, choose the better one as the current best.

• increase emin=emin+De and make forward step selecting SVs

• If the number of support vectors |SV| increases:

decrease emin=emin-De;

decrease De = De/1.2 to avoid large changes

Hypothyroid data

2 years real medical screening tests for thyroid diseases, 3772 cases with 93 primary hypothyroid and 191 compensated hypothyroid, the remaining 3488 cases are healthy; 3428 test, similar class distribution.

21 attributes (15 binary, 6 continuous) are given, but only two of the binary attributes (on thyroxine, and thyroid surgery) contain useful information, therefore the number of attributes has been reduced to 8.

Method % train % test

SFM, SSV+2 B1 features ------- 99.6

SFM, SVMlin+2 B1 features ------- 99.5

MLP+SCG, 4 neurons 99.8 99.2

Cascade correlation 100 98.5

MLP+backprop 99.6 98.5

SVM Gaussian kernel 99.8 98.4

SVM lin 94.1 93.3

Hypothyroid data

Heterogeneous systems

Problems requiring different scales (multiresolution).

2-class problems, two situations:

C1 inside the sphere, C2 outside.MLP: at least N+1 hyperplanes, O(N2) parameters. RBF: 1 Gaussian, O(N) parameters.

C1 in the corner defined by (1,1 ... 1) hyperplane, C2 outside.MLP: 1 hyperplane, O(N) parameters. RBF: many Gaussians, O(N2) parameters, poor approx.

Combination: needs both hyperplane and hypersphere!

Logical rule: IF x1>0 & x2>0 THEN C1 Else C2

is not represented properly neither by MLP nor RBF!

Different types of functions in one model, first step beyond inspirations from single neurons => heterogeneous models.

Heterogeneous everything

Homogenous systems: one type of “building blocks”, same type of decision borders, ex: neural networks, SVMs, decision trees, kNNs

Committees combine many models together, but lead to complex models that are difficult to understand.

Ockham razor: simpler systems are better. Discovering simplest class structures, inductive bias of the data, requires Heterogeneous Adaptive Systems (HAS).

HAS examples:NN with different types of neuron transfer functions.k-NN with different distance functions for each prototype.Decision Trees with different types of test criteria.

1. Start from large networks, use regularization to prune.2. Construct network adding nodes selected from a candidate pool.3. Use very flexible functions, force them to specialize.

Taxonomy - TF

HAS decision trees

Decision trees select the best feature/threshold value for univariate

and multivariate trees:

Decision borders: hyperplanes.

Introducing tests based on La Minkovsky metric.

or ; ,i k k i i k

i

X T W X X W

Such DT use kernel features!

For L2 spherical decision border are produced.

For L∞ rectangular border are produced.

For large databases first clusterize data to get candidate references R.

1/

; , R i i R

i

T X Ra

a X R X R

SSV HAS DT example

SSV HAS tree in GhostMiner 3.0, Wisconsin breast cancer (UCI)699 cases, 9 features (cell parameters, 1..10)Classes: benign 458 (65.5%) & malignant 241 (34.5%).

Single rule gives simplest known description of this data:

IF ||X-R303|| < 20.27 then malignant

else benign coming most often in 10xCV

Accuracy = 97.4%, good prototype for malignant case!

Gives simple thresholds, that’s what MDs like the most!

Best 10CV around 97.5±1.8% (Naïve Bayes + kernel, or SVM)

SSV without distances: 96.4±2.1%

C 4.5 gives 94.7±2.0%

Several simple rules of similar accuracy but different specificity or sensitivity may be created using HAS DT. Need to select or weight features and select good prototypes.

How much can we learn?

Linearly separable or almost separable problems are relatively

simple – deform or add dimensions to make data separable.

How to define “slightly non-separable”?

There is only separable and the vast realm of the rest.

Linear separability

QPC projection used to visualize Leukemia microarray data.

2-separable data, separated in vertical dimension.

Approximate separability

QPC visualization of Heart dataset: overlapping clusters, information in the data is insufficient for perfect classification, approximately 2-separable.

Easy problems

• Approximately linearly separable problems in the original feature space: linear discrimination is sufficient (always worth trying!).

• Simple topological deformation of decision borders is sufficient – linear separation is then possible in extended/transformed spaces.

This is frequently sufficient for pattern recognition problems (half of UCI problems?).

• RBF/MLP networks with one hidden layer also solve such problems easily, but convergence/generalization for anything more complex than XOR is problematic.

SVM adds new features to “flatten” the decision border:

( )

1 2( , ,... ); ,i

n ix x x z K X X X X

achieving larger margins/separability in the X+Z space.

Neurons learning complex logic

Boole’an functions are difficult to learn, n bits but 2n nodes => combinatorial complexity; similarity is not useful, for parity all neighbors are from the wrong class. MLP networks have difficulty to learn functions that are highly non-separable.

Projection on W=(111 ... 111) gives clusters with 0, 1, 2 ... n bits;

easy categorization in (n+1)-separable sense.

Ex. of 2-4D parity problems.

Neural logic can solve it without counting; find a good point of view.

Easy and difficult problems

Linear separation: good goal if simple topological deformation of decision borders is sufficient.

Linear separation of such data is possible in higher dimensional spaces; this is frequently the case in pattern recognition problems.

RBF/MLP networks with one hidden layer solve such problems.

Difficult problems: disjoint clusters, complex logic.

Continuous deformation is not sufficient; networks with localized functions need exponentially large number of nodes.

Boolean functions: for n bits there are K=2n binary vectors that can be represented as vertices of n-dimensional hypercube.

Each Boolean function is identified by K bits.

BoolF(Bi) = 0 or 1 for i=1..K, leads to the 2K Boolean functions.Ex: n=2 functions, vectors {00,01,10,11},

Boolean functions {0000, 0001 ... 1111}, ex. 0001 = AND, 0110 = OR,

each function is identified by number from 0 to 15 = 2K-1.

Boolean functions

n=2, 16 functions, 12 separable, 4 not separable.

n=3, 256 f, 104 separable (41%), 152 not separable.

n=4, 64K=65536, only 1880 separable (3%)

n=5, 4G, but << 1% separable ... bad news!

Existing methods may learn some non-separable functions, but most functions cannot be learned !

Example: n-bit parity problem; many papers in top journals.

No off-the-shelf systems are able to solve such problems.

For all parity problems SVM is below base rate!

Such problems are solved only by special neural architectures or special classifiers – if the type of function is known.

But parity is still trivial ... solved by 1

cosn

i

i

y b

Goal of learning

If simple topological deformation of decision borders is sufficient linear separation is possible in higher dimensional spaces, “flattening” non-linear decision borders, kernel approaches are sufficient. RBF/MLP networks with one hidden layer solve the problem.

This is frequently the case in pattern recognition problems.

For complex logic this is not sufficient; networks with localized functions need exponentially large number of nodes.

Such situations arise in AI reasoning problems, real perception, object recognition, text analysis, bioinformatics ...

Linear separation is too difficult, set an easier goal. Linear separation: projection on 2 half-lines in the kernel space:

line y=WX, with y<0 for class – and y>0 for class +.

Simplest extension: separation into k-intervals, or k-separability.

For parity: find direction W with minimum # of intervals, y=W.X

What NN components really do?

Vector mappings from the input space to hidden space(s) and to the output space + adapt parameters to improve cost functions.

Hidden-Output mapping done by MLPs:

T = {Xi} training data, N-dimensional.

H = {hj(T)} X image in the hidden space, j =1 .. NH-dim.

... more transformations in hidden layers

Y = {yk(H )} X image in the output space, k =1 .. NC-dim.

ANN goal:

data image H in the last hidden space should be linearly separable; internal

representations will determine network generalization.

But we never look at these representations!

8-bit parity solution

QCP solution to 8-bit parity data: projection on W=*1,1…1+ diagonal.

k-separability is much easier to achieve than full linear separability.

Network solution

Can one learn a simplest model for arbitrary Boolean function?

2-separable (linearly separable) problems are easy; non separable problems may be broken into k-separable, k>2.

Blue: sigmoidal neurons with threshold, brown –linear neurons.

X1

X2

X3

X4

y=W.X +

1

1

+

1

1

s(by+1)

s(by+2)

+

1

+

1+

1+

1

s(by+4)

Neural architecture for k=4 intervals, or 4-separable problems.

QPC Projection Pursuit

What is needed to learn data with complex logic?

• cluster non-local areas in the X space, use W.X

• capture local clusters after transformation, use G(W.X-)

SVMs fail because the number of directions W that should be

considered grows exponentially with the size of the problem n.

What will solve it? Projected clusters!

1. A class of constructive neural network solution with G(W.X-) functions combining non-local/local projections, with special training algorithms.

2. Maximize the leave-one-out error after projection: take some localized function G, count in a soft way cases from the same class as Xk.

Grouping and separation; projection may be done directly to 1 or 2D for visualization, or higher D for dimensionality reduction, if W has d columns.

k k

k k

C C

Q A G A G

X X X

W W X X W X X

Parity n=9

Simple gradient learning; quality index shown below.

Learning hard functions

Training almost perfect for parity, with linear growth in the number of vectors for k-sep. solution created by the constructive neural algorithm.

Real data

On simple data results are similar as from SVM (because they are almost optimal), but models are much simpler.

Rules

QPC visualization of Monks artificial symbolic dataset, => two logical rules are needed.

Complex distribution

QPC visualization of concentric rings in 2D with strong noise in remaining 2D; transform: nearest neighbor solutions, combinations of ellipsoidal densities.

Example: aRPM

Almost Random Projection Machine (with Hebbian learning):

generate random combinations of inputs (line projection) z(X)=W.X,

find and isolate pure cluster h(X)=G(z(X)); estimate relevance of h(X), ex. MI(h(X),C), leave only good nodes;

continue until each vector activates minimum k nodes.

Count how many nodes vote for each class and plot.

Learning from others …

Learn to transfer knowledge by extracting interesting features created by different systems. Ex. prototypes, combinations of features with thresholds …

=> Universal Learning Machines.

Example of feature types:

B1: Binary – unrestricted projections b1

B2: Binary – complexes b1 ᴧ b2 … ᴧ bk

B3: Binary – restricted by distance

R1: Line – original real features ri; non-linear thresholds for “contrast enhancement“ s(ribi); intervals (k-sep).

R4: Line – restricted by distance, original feature; thresholds; intervals (k-sep); more general 1D patterns.

P1: Prototypes: general q-separability, weighted distance functions or specialized kernels.

M1: Motifs, based on correlations between elements rather than input values.

1 1 1 2 2 20 , , ...ib r r r r r r

B1/B2 Features

Dataset B1/B2 Features

Australian F8 < 0.5 F8 ≥ 0.5 ᴧ F9 ≥ 0.5

Appendicitis F7 ≥ 7520.5 F7 < 7520.5 ᴧ F4 < 12

Heart F13 < 4.5 ᴧ F12 < 0.5 F13 ≥ 4.5 ᴧ F3 ≥ 3.5

Diabetes F2 < 123.5 F2 ≥ 143.5

Wisconsin F2 < 2.5 F2 ≥ 4.5

Hypothyroid F17 < 0.00605 F17 ≥ 0.00605 ᴧ F21 < 0.06472

Example of B1 features taken from segments of decision trees.These features used in various learning systems greatly simplify their models and increase their accuracy. Almost all systems reach similar accuracy!

Dataset Classifier

SVM (#SV) SSV (#Leafs) NB

Australian 84.9±5.6 (203) 84.9±3.9 (4) 80.3±3.8

ULM 86.8±5.3(166) 87.1±2.5(4) 85.5±3.4

Features B1(2) + P1(3) B1(2) + R1(1) + P1(3) B1(2)

Appendicitis 87.8±8.7 (31) 88.0±7.4 (4) 86.7±6.6

ULM 91.4±8.2(18) 91.7±6.7(3) 91.4±8.2

Features B1(2) B1(2) B1(2)

Heart 82.1±6.7 (101) 76.8±9.6 (6) 84.2±6.1

ULM 83.4±3.5(98) 79.2±6.3(6) 84.5±6.8

Features Data + R1(3) Data + R1(3) Data + B1(2)

Diabetes 77.0±4.9 (361) 73.6±3.4 (4) 75.3±4.7

ULM 78.5±3.6(338) 75.0±3.3(3) 76.5±2.9

Features Data + R1(3) + P1(4) B1(2) Data + B1(2)

Wisconsin 96.6±1.6 (46) 95.2±1.5 (8) 96.0±1.5

ULM 97.2±1.8(45) 97.4±1.6(2) 97.2±2.0

Features Data + R1(1) + P1(4) R1(1) R1(1)

Hypothyroid 94.1±0.6 (918) 99.7±0.5 (12) 41.3±8.3

ULM 99.5±0.4(80) 99.6±0.4(8) 98.1±0.7

Features Data + B1(2) Data + B1(2) Data + B1(2)

T-based metalearning

To create successful meta-learning through search in the model space fine granulation of methods is needed, extracting info using support features, learning from others, knowledge transfer and deep learning.

Learn to compose, using complexity guided search, various transformations (neural or processing layers), for example:

• Creation of new support features: linear, radial, cylindrical, restricted localized projections, binarized … feature selection or weighting.

• Specialized transformations in a given field: text, bio, signal analysis, ….

• Matching pursuit networks for signal decomposition, QPC index, PCA or ICA components, LDA, FDA, max. of mutual information etc.

• Transfer learning, granular computing, learning from successes: discovering interesting higher-order patterns created by initial models of the data.

• Stacked models: learning from the failures of other methods.

• Schemes constraining search, learning from the history of previous runs at the meta-level.

Summary

• Challenging data cannot be handled with existing DM tools.

• Similarity-based framework enables meta-learning as search in the model space, heterogeneous systems add fine granularity.

• No off-shelf classifiers are able to learn difficult Boolean functions.

• Visualization of hidden neuron’s shows that frequently perfect but non-separable solutions are found despite base-rate outputs.

• Linear separability is not the best goal of learning, other targets that allow for easy handling of final non-linearities should be defined.

• k-separability defines complexity classes for non-separable data.

• Transformation-based learning shows the need for component-based approach to DM, discovery of simplest models and support features.

• Meta-learning replaces data miners automatically creating new optimal learning methods on demand.

Is this the final word in data mining? Only the future will tell.

Meta-learningresearchsurvey

W. Duch, K.Gr¡bczewski,N. Jankowski

What ismeta-learning?

Committeesof decisionmodels

Meta-levelregression

Rankings ofalgorithms

Meta-learning as asearchprocess

Contents

1 What is meta-learning?

2 Committees of decision models

3 Meta-level regression

4 Rankings of algorithms

5 Meta-learning as a search process

2 / 19








What is meta-learning?

I Generally, meta-learning encompasses all e�orts to learn

how to learn including gathering meta-knowledge and

using meta-knowledge in further learning.

I Meta-knowledge is knowledge about learning processes,

about in�uence of machine parameters on �nal results, etc.

During last two decades, the term meta-learning has been

used in many di�erent contexts:

I building committees of decision models,

I building regression models predicting machine accuracy,

I building algorithms rankings for given datasets,

I searching through spaces of learning machines

parameters augmented by meta-knowledge and gathering

new meta-knowledge.

3 / 19








Committees of decision models

Decisionmodule

Member 1 . . . Member k

I Simple committees do not learn at meta-level: e.g.

simple majority voting.

I Some �intelligent� decision modules perform meta-analysis.

� Bagging, arcing, boosting � perform somemeta-analysis to build more stable decision makers(Dietterich 1997) and are very popular, but this is notexactly what we would name �meta-learning�.

� Stacking � the decision module is a meta-level learner.� Many advanced, heterogeneous, undemocratic

committees have been published.

4 / 19








Stacking

I Learning machines are trained on results of a group of

models.

I Stolfo et al. (1997), Prodromidis et al. (2000) � JAM

(Java Agents for Meta-learning) � a parallel, distributed

system for scalable computing.

I Todorovski and Dzeroski (2003) � Meta Decision Trees

� properly adapted C4.5 decision trees determine which

model to use.

I NOEMON � Kalousis and Theoharis (1999), Kalousis and

Hilario (2000) � also called stacking a meta-learning.

5 / 19








Undemocratic committees

Meta-analysis may lead to estimation of the areas and degrees

of competence of each base learner to provide more

reasonable decision of the decision module.

I Chan and Stolfo (1993, 1998) �� Meta-learning by arbitration and combining.� Arbiters: binary tree of arbiters (members organized in

pairs, arbiter for each pair, arbiters in pairs, and so on),� Combiners: a sort of stacking.� Combiners compute a prediction that may be entirely

di�erent from any proposed by base models, whereasarbiters choose one of the predictions of the base models.

I Duch and Itert (2003) de�ne incompetence functions

that describes member (in)competence in particular points

of the data space.

I Jankowski and Gr¡bczewski (2005) re�ect global and

local competence in �nal ensemble decisions.6 / 19








Learnt Topology GANNs

Kadlec and Gabrys (2008): LTGANN - Learnt Topology

Gating Arti�cial Neural Networks

I Each committee member consists of two networks: local

expert and gating network.

I Local expert learns the task given to the committee.

I Gating network learns performance of the local

expert. It links the expert performance to positions of

samples in the input space.

I Final decision is made by weighting.

I Flexibility for the ANN topology selection: there is no

need to de�ne the exact number of hidden units either

for the local experts or for the gating networks.

7 / 19








Meta-level regression

I Regression methods predict accuracies of di�erent

learning machines on the basis of dataset descriptions.

I Koepf et al. (2000), Bensusan and Kalousis (2001):

� Input: dataset description as a series of values derivedfrom information theory and statistics.

� Output: accuracy of the model (usually classi�er).

I Ranking learning machines:

� One regression model for each algorithm to rank.� Machines are ranked in the decreasing order of predicted

accuracy.

8 / 19








Rankings of algorithms

I The most popular approach initiated by (probably largest

so far) meta-learning project MetaL (1998-2002).

I Rankings learned from simple descriptions of data.

DataMeta-featuresdescribing

data

Ranking ofalgorithms

I Meta-attributes are basic data characteristics: number of

instances, number of features, types of features (continuous

or discrete, how many of which), data statistics etc.I Rankings are generated by meta-learners:

� for each pair of algorithms to be ranked, a classi�cationalgorithm is trained on two-class datasets describing winsand losses of the algorithms on some collection of datasets,

� decisions of meta-classi�ers are combined to build �nalranking.

9 / 19








Algorithm selection problem

Algorithm selection problem (ASP)

I May be regarded as equivalent to building algorithm

rankings.

I ASP was addressed already by Rice (1974):D ∈ Dproblem

space

A ∈ Aalgorithm

space

m ∈ Rn

performance

measure space

||m|| ∈ Rperformance

norm

S(D) p(A, D) || · ||

I Most often, it gets reduced to the problem of assigning

optimal algorithm to a vector of features describing data,

which is quite restrictive.

10 / 19








No Free Lunch theorems

No free lunch theorems, in this context, may be expressed as:

Each single learning algorithm tested on all possible data-

sets will be, on average, as accurate as random choice.

So does building learning machines make any sense?

I Yes, because all possible datasets is what makes NFL

provable but useless!

I In the context of training and test, �all possible� means

also those, where training and test come from completely

di�erent distributions, are completely unrelated.

I We expect training data representative for the population

and NFL does not care about representativeness.

I Inductive bias of algorithms is not an explanation.

Conclusion: let's not pay much attention to NFL!11 / 19








Landmarking

I Pfahringer et al. (2000) � the idea of landmarking: usingmeta-features measuring the performance of somesimple and e�cient learning algorithms (landmarkers).

� linear discriminant learner,� naive bayes learner,� C5.0 tree learner.

Meta-learners used:

� C5.0 trees and rules,� boosted C5.0,� RIPPER,� LTREE,� linear discriminant,� naive bayes,� nearest neighbor.

12 / 19








Landmarking continued

Fürnkranz and Petrak (2001):

I Relative landmarking � meta-attributes describe relationsbetween results instead of accuracies:

� Ranks of landmarkers,� Order of landmarkers (inverse of ranks),� Pairwise comparisons between accuracies of landmarkers

(+1, -1, ?),� Pairwise accuracies ratios (continuous).

I Subsampling � original datasets reduced to facilitate

landmarking by algorithms of larger computational

complexity.

13 / 19








Landmarking continued

Soares et al. (2001)

I Relative landmarking and subsampling combined.

I Adjusted ratio of ratios (ARR) index � a combination

of accuracy and time to assess relative performance:

ARRdi,j =

Adi

Adj

1 + log

(T di

T dj

)∗X

Adi and T d

i are accuracy and time of i'th landmarker on

data d, X is a parameter: �the amount of accuracy we are

willing to trade for 10-times speed-up�.

I When n > 2 algorithms are involved, they calculate

relative landmark:

rldi =

∑j 6=iARRd

i,j

n− 1.

14 / 19








Still landmarking. . .

Brazdil and Soares (2000), Brazdil et al. (2003)

I more advanced statistical measures of datasets

(including histogram analysis and information theory based

indices) as meta-attributes,

I k nearest neighbor (kNN) algorithm choses similar

datasets; ranking created from results obtained by ranked

algorithms on the nearest neighbors,

I methods of combining results to create rankings:� ARR � adjusted ratio of ratios,� counting statistically signi�cant di�erences in results:

average ranks (AR) and signi�cant wins (SW).� ranking methods estimated by comparison to the ideal

ranking

• Spearman's rank correlation coe�cient,• Friedman's signi�cance test,• Dunn's multiple comparison technique.

15 / 19








NOEMON

Kalousis and Theoharis (1999), Kalousis and Hilario (2000)

I NOEMON is a system advising which classi�er to use and

performing classi�cation.

I Meta-learning space contains data characteristics

(statistics, concentration histograms) and performance

measures of algorithms.

I Di�erent performance measures gathered: accuracy,

training time, execution time, resource demand.

I Meta-learner trained on results of pairs of algorithms

� kNN,� decision trees.

I Model selection based on comparisons of results of pairs

of algorithms.

16 / 19








Other landmarking related approaches

I DecT by Peng et al. (2002)� Data characteristics derived from the structure of C5.0

decision trees built on the data.� Like in other approaches:

• kNN to select similar datasets,• rankings by ARR,• Spearman's correlation coe�cient to estimate rankings.

I Bensusan et al. (2000)� Landmarking and decision trees techniques combined.� Typed higher-order inductive learning directly from

decision trees instead of trees characteristics.I Todorovski et al. (2002)

� Meta-data obtained from statistics, information theory andlandmarking.

� Predictive Clustering Trees � multi-split decision trees• minimization of intra-clusters variance and maximization

of inter-clusters variance, clusters contain data with

similar relative performance of algorithms,• ranks instead of accuracies � ranking trees. 17 / 19








Rankings of algorithms � general remarks

I Very naive approach: simple data transformation may

completely change the ranking.

I Resembles common approach to split data analysis process

into data preprocessing stage and �nal learning.

I We are not interested in raw rankings, but in complex

machine combinations that model the data as accurately

as possible.

I Even very accurate rankings do not give hints about

data transformations that could improve the results.

I No human expert would use such technique to select most

promising learning algorithms.

18 / 19








Meta-learning as an advanced search

process

I Fundamental aim of meta-learning is to be more

successful in object-level (base-level) learning.I What do human experts do to obtain optimal model for

given data?� search for solutions by testing subsequent candidates,� test candidates not at random but after selection and in

order based on some meta-knowledge,� gain new meta-knowledge (general and speci�c to the

task being solved) while learning.I Automated meta-learning should mimic behavior of

human experts. Therefore, in our approach, we:� generate candidate machine con�gurations according

to meta-knowledge (initially from human experts),� order candidates with special complexity measure,� test candidates to create a ranking,� gather new meta-knowledge and re�ne human experts'

meta-knowledge to successfully drive the search process.19 / 19

Universal meta-learning architecture and algorithms

(DI/NCU) Meta-learning—WCCI 2010 1 / 48

Learning and meta-learning

Learning and meta-learning

A learning problem can be defined as P = 〈D,M〉, where D ⊆ D is alearning dataset andM is a model space.Learning process of a learning machine L is a function A(L):

A(L) : KL ×D →M (1)

Machine

Input 1...Input n

Output 1...Output m

Machine processparameters

Resultsrepository

Meta-learning is another or rather specific learning of machine.ML = Learn how to learn, to learn as well as possible.


Learning and meta-learning Decomposition of learning problem

Decomposition of learning problem

To decompose the problem P = 〈D,M〉 into subproblems:

P = [P1, . . . ,Pn], Pi = 〈Di ,Mi 〉 (2)

In result model for the main problem P is:

m = [m1, . . . ,mn], (3)

and the model space gets the form

M =M0 × . . .×Mn. (4)

The solution constructed by decomposition is often much easier tofind.



The model mi solving the subproblem Pi , is the result of learning process

A(Li ) : KLi ×Di →Mi , i = 1, . . . , n, (5)

whereDi =

∏

k∈Ki

Mk , (6)

and Ki ⊆ {0, 1, . . . , i − 1},M0 = D.The main learning process L is decomposed to the vector

[L1, . . . ,Ln]. (7)

Data loader

Data

Vector selector

Data Data

Classifier

Data Classifier

Such decomposition is often very natural: a standardization orfeature selection naturally precedes classification, a number ofclassifiers precede committee module etc.



Goal of meta-learning

Finding an optimal model for given data mining problem P is almostalways NP hard.Meta-learning could find solutions which at least are not worst thanthe ones that can be found by human experts in data mining in givenlimited amount of time.Experts usually restrict their tests to a part of algorithms.

Goal of meta-learning isTo maximize the probability of finding possibly best solution within asearch space of given problem P in as short time as possible.

Complexity is used to order machines.



Configuring Meta-learning Problem

MLA need NOT be a mistery!

Appropriate configuring of MLA should enable find solution of anyP.Ex.: If we want use it for classification or approximation, don’t fix it insideMLA!

Configuration of MLA is be specified byStating of Functional Searching SpaceDefining the Goal of Meta-learning (of problem P)Defining the Stop ConditionDefining the Attractiveness ModuleDefining Initial Meta-knowledge

Above elements of configuration guarantees that algorithm is universal.(DI/NCU) Meta-learning—WCCI 2010 6 / 48

Learning and meta-learning Search space = generators flow

Search space of MLA

Machine Configuration Generators & Generators Flow

Fixed set of LM inside MLA is not flexible and strongly limits MLA!The goal of machine configuration generators (MCG) is toprovide/produce machine configurations.The goal of generators flow is to provide machine configuration toMLA.Generators flow is a graph (DAG) of machine configuration generators.Each MCG may base on different meta-knowledge and may revealdifferent behavior, which in particular may even change in time (duringthe meta-learning progress).

Simplest generators flow

output

Classifiers setGenerator



Set-based Generators

The simplest and very useful machine generator is based on the idea toprovide just an arbitrary (initial) sequence of machine configurations.Usually, it is convenient to have a few set-base generators in singlegenerators flow

set-base generator of simple classifiersset-base generator of classifiersset-base generator of feature rankingset-base generator of prototype selectorsset-base generator of committees . . .



Schemes and Machine Configuration Templates

A machine configuration with empty scheme (or empty schemes) as childmachine configuration is called a machine configuration template.

Feature selection template

Data

Data

Transformation

Ranking

Data Feature ranking

Feature selection

Data

Feature ranking

Data

Transformation

The role of empty scheme is defined by type of inputs and outputs.



Template-based Generators

Transform and classify

Data Classifier

Transformation scheme

Data

Data

Transformation

Classifier scheme

Data Classifier

kNNData Classifier


Data Classifier


Data

Data

Transformation

Classifier scheme

Data Classifier

Simple generators flow

output

Classifiers setGenerator

Transformers setGenerator

Transform andclassify machineGenerator



Generator flow example

Generators flow

flow output

TransformerGenerator

RankingGenerator

ClassifiersGenerator

Feature Selectionof RankingsGenerator

Transform andClassify Generator II

Transform andClassify Generator

MPS/FS ofTransform & ClassifyGenerator

MPS for ClassifiersGenerator



A part of generator flow (1) . . .

Part of generators flow

flow output

Ranking DiscrGenerator

RankingGenerator


Discr & RankGenerator

Feature Selectionof RankingsGenerator

Transform andClassify Generator II

MPS/FS ofTransform & ClassifyGenerator



Schemes of a committee & generator flow with committee

Committee scheme

Data Classifier

Classifiers

Data Classifier

Committee

Classifier Classifier

Part of Generators flow (2)

flow output

Committee modulesGenerator

Classifier SequencesGenerator

Committee Schemesgenerator



Advanced Generators

Advanced generators can learn from meta-knowledge,half-independently of the heart of the meta-learning algorithm, andbuild their own specialized meta-knowledge.Advanced generators are informed each time a test task is finished.The generators may read the results of test task and the currentmachine configuration ranking.Different generators may realize different strategies. No singleperfect strategy.

Part of Generators flow with intelli Comm

flow output

Committee modulesGenerator

Intelligent Committeegenerator

Intelligent committee generator observe progress of meta-learning.Model diversity may be simply applied.



Generation of ε-machines

Part of Generators flow with ε-machines

flow output

Some baseclassifiers Generator

ε-generator

In some cases then original data are huge. It is important to start withsome so sharp simplified machines—such that based on shrunk data.Filter by # vectors, # features. Provide independent configurations


Configuring of meta-learning

Defining the Goal of Meta-learning for particular problem PTest procedure + evaluation of results

Strict and flexible definition of the goal!Repeater

Data

Distributor scheme

Data

Training data

Test data

Test scheme

Training data

Test data

CV distributor

Data

Training data

Test data

ClassifierData Classifier

Classification testData

Classifier

Test procedure finished ⇒ Results evaluation ≡ QueryQuery definition = evaluation of results:

which nodes, which values, how to transform


Configuring of meta-learning

Defining Initial Meta-knowledge

Initial meta-knowledgeandtheir different kinds. . .This touch many substructures which are another aspect ofconfiguration.For example MPS and their meta-knowledge.Knowledge of advanced generators.and more . . .but no common–single meta-knowledge!


Meta-learning algorithm General meta-learning scheme

General meta-learning scheme

START initialize

stopcondition

evaluateresults

start sometest tasks finalize

wait forany task STOP

yes

no


Meta-learning algorithm Test Tasks Starting

Test Tasks Starting

From generators flow to heap.Heap is ordered by complexity.

16 procedure startTasksIfPossible;17 while (¬ machinesHeap.Empty() ∧ ¬ mlTaskSpooler.Full())18 {19 <mc, cmplx> = machinesHeap.ExtractMinimum();20 timeLimit = τ · cmplx.time / cmplx.q21 mlTaskSpooler.Add(mc, limiter(timeLimit), priority−−);22 }23 end

Machine heap vs. task spooler.The complexity is approximated for given configuration.To bypass thehalting problem and the problem of (the possibility of) inaccurateapproximation == Test task time limit.


Meta-learning algorithm Analysis of Finished Tasks

Analysis of Finished Tasks

Finish: normally or halted by (time-)limiter

24 procedure analyzeFinishedTasks;25 foreach (t in mlTaskSpooler.finishedTasks) {26 mc = t.configuration;27 if (t.status = finished_normally) {28 qt = computeQualityTest(t, queryDefinition);29 machinesRanking.Add(qt, mc);30 if(attractivenessModule is defined)31 attractivenessModule.Analyze(t, qt, machinesRanking);32 machineGeneratorsFlow.Analyze(t, qt, machinesRanking);33 } else { // task broken by limiter34 mc.cmplx.q = mc.cmplx.q / 4;35 machinesHeap.Quarantine(mc);36 }37 mlTaskSpooler.RemoveTask(t);38 }39 end


Machine Complexity Evaluation Complexity measure

Which complexity?

CL(P) = minp{c∗(p) : program p prints P in time tp} (8)

Kolmogorov (?):ck(p) = l(p) (9)

Levin (?):cl (p) = l(p) + log tp (10)

The problem of using above def — too small time influence.Ex.: a program running 1024 times longer than another one may havejust a little bigger complexity: just +10.

More realistic version:

ca(p) = l(p) + tp/ log tp (11)



Quarantine and penalty term

Complexity approximation ⇒ errors.Quarantine + penalty term:

cb(p) = [l(p) + tp/ log tp]/q(p) (12)

where q(p) reflect an estimate of reliability of p.No halting problem inside tested machines.This mechanism prevents from running test tasks for unpredictablylong time.If test task broke the time limit, the reliability decrease and testtask come back to the machine heap.



Attractiveness

The meta-knowledge may influence the order of test taskswaiting in the machine heap and machine configurations which will beprovided during the process.The optimal way of doing this, is adding a new term to the cb(p) toshift the start time of given test in appropriate direction:

cc(p) = [l(p) + tp/ log(tp)]/[q(p) · a(p)]. (13)

a(p) reflects the attractiveness of the test task p.


Machine Complexity Evaluation Complexities of What Machines are We Interested in?

Complexities of What Machines are We Interested in?

The complexity of test task (from test template!) with nestedlearning machine.This complexity really well reflects complete behavior of themachine:

a part of test scheme reflects the complexity of learning of givenmachinethe rest reflects the complexity of computing the test (for exampleclassification test).

Costs of learning reflects the costs of creation of given machine.Costs of testing reflects the costs of using of given machine.

learning costs + testing costs = costs of machine using .


Machine Complexity Evaluation The needs of complexity computation

The needs of complexity

To provide a learning machine (a simple one or a complex machine),its configuration and inputs must be specified:

A(Li ) : KLi ×Di →Mi , i = 1, . . . , n, (14)

Complexity computation must reflect the information fromconfiguration and inputs.The recursive nature of configurations, together with input–outputconnections.



Meta-inputs descriptions, meta-outputs


Training Data Classifier


Data

Data

Transformation

PCATraining data

Data

Transformation

Classifier scheme

Training data Classifier

kNNTraining data Classifier

Test scheme

Training Data

Test Data

SVM

Data Classifier

Classification testData

Classifier

It would be impossible to compute complexity of kNN before finishingPCA.To make complexity computation possible we use proper meta-inputsdescriptions.Meta-inputs are counterparts of inputs in the “meta-world”.Additionally complexity computation must also providemeta-outputs.



Computing the complexity

A function computing the complexity for machine L should be atransformation:

DL : KL ×M+ → R2 ×M+, (15)

where KL is the configurations space andM+ is the space ofmeta-inputs (and outputs). R2 reflects the time complexity and thememory complexity.The problem is not as easy as the above form of the function. Findingthe right function for given learning machine L may be impossible.Configuration elements are not always as simple as scalarvalues.In some cases configuration elements are represented by functionsor by subconfigurations.

Similar problem concerns meta-inputs.(DI/NCU) Meta-learning—WCCI 2010 27 / 48

Machine Complexity Evaluation Meta Evaluators

Meta Evaluators

For so high level of generality we will use meta-evaluators.The general goal of meta-evaluator is to exhibit a functionaldescription of complexity aspects (comments) useful for further reuseby other meta evaluators. In case of a machine the meta-outputs areexhibited to provide complexity information source for their inputsreaders.

Learning machine Meta evaluator

meta evaluators are constructed not only for machines:

Machine

Evaluator for Output

Nontrivial objects

Each evaluator will need adaptation, which can be seen as aninitialization and can be compared to the learning of machine.



Output evaluatorsThanks to output evaluator:

Do : I1 →M1, (16)

where I1 is a space of (single) output andM1 is the space of singlemeta-output.we do not need machine evaluator of below form:

D′L : KL × I+ → R2 ×M+, (17)

where I+ is the space of machine L inputs.Nontrivial object evaluators

Sometimes, machine complexity depends on nontrivial elements ofconfiguration. For example, metric in configurations of kNN orSVMWe use metric evaluator to make the kNN’s evaluator independentdirectly from metric:

Dobj : OBJ →Mobj , (18)



Machine Evaluator

Additional functionality for machine evaluators:Declarations of output descriptions:

If given machine provides outputs, then also the output evaluators,devoted to this machine type, must provide meta-descriptions of theoutputs. The descriptions of outputs are meta evaluators ofappropriate kind (for example meta-classifier, meta-transformer,meta-data etc.).

Time & Memory:The two quantities must be provided by each machine evaluator toenable proper computation of time and memory complexity.

Child Evaluators:for advanced analysis of complex machines complexities, it is necessaryto have access to meta evaluators of submachines. Child evaluatorsare designed to provide this functionality.



Classifier EvaluatorEvaluator of a classifier output, has to provide the time complexity ofclassification of an instance:

real ClassifyCmplx(DataEvaluator dtm);

This is the time consumed by the instance classification routine.

Approximator EvaluatorEvaluator of an approximation machine has exactly the same functionalityas the one of a classifier, except that approximation time is considered inplace of classification time:

real ApproximationCmplx(DataEvaluator dtm);



Data Evaluators

In the context of data tables, the data evaluators should provideinformation like? the number of instances, ? the number of features, ? descriptions offeatures (ordered/unordered, number of values, etc.), ? descriptions oftargets, ? statistical information per feature, ? statistical informationper data and ? others that may provide useful information to computecomplexities of different machines learning from the data.


Learning Evaluators Approximation framework

Approximation framework

If evaluators may not be defined in analytical way the approximationframework is used to construct evaluators.

Plain Evaluator

Ready to use evaluator

adaptation

Learnable Evaluator

data collection

evaluator learning

Ready to use evaluator

adaptation

Before an evaluator is used by a meta-learning process, all itsapproximators must be trained.Each evaluator is learned once.



Environment for machine behavior observation

To collect the learning data, proper information has to be extractedform observations of “machine behavior” .The “environment” for machine monitoring must be defined. Theenvironment configuration is sequentially adjusted, realized andobserved.Environment is defined by: initial configuration of observing scheme

Changes of the environment are realized by specialized scenarios,which defines how to modify the environment.Each scene is a source of single learning pair of single approximator.Each evaluator may consists of few independent approximators.

initial environment + scenes scenario



Data collection functionality

Full control of data acquisition is possible thanks to collection ofmethods implemented by each evaluator:

EvaluatorBase is used to prepare the evaluator for analysis of theenvironment being observed.GetMethodsForApprox declares additional approximation tasks.ApproximatorDataIn andApproximatorDataOut prepare subsequence of input–output vectors.



STARTInit: {startEnv,envScenario}

Try to generatenext configuration‘oc’ using startEnv& envScenario.Succeeded?

Train eachapproximator

Run machines ac-cording to ‘oc’

Return seriesof approximatorsfor evaluator

Extract in-out pairsfrom the project

for each approximatorSTOP

no

yes

data

collectionloop



Elements of MLA and their connections

Test taskHeap

ComplexityApproximator

EvaluatorRepository

Test taskGenerator

Quarantine AttractivenessModule

GeneratorsFlow

Meta-learningSearch Loop

Meta-learningResults


Meta-learning in Action

Meta-learning in Action—Generators flow

To make the results comparable all test were prepared using belowgenerator flow:

Generators flow

output

RankingGenerator

Feature selectionof RankingsGenerator


MPS for classifiersgenerator

Transform andclassify generator

MPS/FS ofTransform & Classifygenerator



Classifier set:kNN (Euclidean) — k Nearest Neighbors / Euclidean metric,kNN [MetricMachine (EuclideanOUO)] — kNN w. Euclidean and HammingkNN [MetricMachine (Mahalanobis)]NBC — Naive Bayes ClassifierSVMClassifier — Support Vector Machine / Gaussian kernelLinearSVMClassifier — SVM with linear kernel[ExpectedClass, kNN [MetricMachine (EuclideanOUO)]][LVQ, kNN (Euclidean)] — Learning Vector Quantization algorithm (?)Boosting (10x) [NBC] — boosting algorithm with 10x NBCs.

Ranking set:RankingCC — correlation coefficient based feature ranking,RankingFScore — Fisher-score based feature ranking.



1 kNN (Euclidean)2 kNN [MetricMachine (EuclideanOUO)]3 kNN [MetricMachine (Mahalanobis)]4 NBC5 SVMClassifier [KernelProvider]6 LinearSVMClassifier [LinearKernelProvider]7 [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]]8 [LVQ, kNN (Euclidean)]9 Boosting (10x) [NBC]10 [[[RankingCC], FeatureSelection], [kNN (Euclidean)], TransformAndClassify]11 [[[RankingCC], FeatureSelection], [kNN [MetricMachine (EuclideanOUO)]],

TransformAndClassify]12 [[[RankingCC], FeatureSelection], [kNN [MetricMachine (Mahalanobis)]], Trans-

formAndClassify]13 [[[RankingCC], FeatureSelection], [NBC], TransformAndClassify]14 [[[RankingCC], FeatureSelection], [SVMClassifier [KernelProvider]], Transfor-

mAndClassify]



15 [[[RankingCC], FeatureSelection], [LinearSVMClassifier [LinearKernelProvider]],TransformAndClassify]

16 [[[RankingCC], FeatureSelection], [ExpectedClass, kNN [MetricMachine (Eu-clideanOUO)]], TransformAndClassify]

17 [[[RankingCC], FeatureSelection], [LVQ, kNN (Euclidean)], TransformAndClas-sify]

18 [[[RankingCC], FeatureSelection], [Boosting (10x) [NBC]], TransformAndClas-sify]

19 [[[RankingFScore], FeatureSelection], [kNN (Euclidean)], TransformAndClassify]20 [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (EuclideanOUO)]],

TransformAndClassify]21 [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (Mahalanobis)]],

TransformAndClassify]22 [[[RankingFScore], FeatureSelection], [NBC], TransformAndClassify]23 [[[RankingFScore], FeatureSelection], [SVMClassifier [KernelProvider]], Transfor-

mAndClassify]24 [[[RankingFScore], FeatureSelection], [LinearSVMClassifier [LinearKernel-

Provider]], TransformAndClassify]



25 [[[RankingFScore], FeatureSelection], [ExpectedClass, kNN [MetricMachine (Eu-clideanOUO)]], TransformAndClassify]

26 [[[RankingFScore], FeatureSelection], [LVQ, kNN (Euclidean)], TransformAnd-Classify]

27 [[[RankingFScore], FeatureSelection], [Boosting (10x) [NBC]], TransformAnd-Classify]

28 ParamSearch [kNN (Euclidean)]29 ParamSearch [kNN [MetricMachine (EuclideanOUO)]]30 ParamSearch [kNN [MetricMachine (Mahalanobis)]]31 ParamSearch [NBC]32 ParamSearch [SVMClassifier [KernelProvider]]33 ParamSearch [LinearSVMClassifier [LinearKernelProvider]]34 ParamSearch [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]]35 ParamSearch [LVQ, kNN (Euclidean)]36 ParamSearch [Boosting (10x) [NBC]]37 ParamSearch [[[RankingCC], FeatureSelection], [kNN (Euclidean)], Transfor-

mAndClassify]



38 ParamSearch [[[RankingCC], FeatureSelection], [kNN [MetricMachine (Eu-clideanOUO)]], TransformAndClassify]

39 ParamSearch [[[RankingCC], FeatureSelection], [kNN [MetricMachine (Maha-lanobis)]], TransformAndClassify]

40 ParamSearch [[[RankingCC], FeatureSelection], [NBC], TransformAndClassify]41 ParamSearch [[[RankingCC], FeatureSelection], [SVMClassifier [KernelProvider]],

TransformAndClassify]42 ParamSearch [[[RankingCC], FeatureSelection], [LinearSVMClassifier [LinearKer-

nelProvider]], TransformAndClassify]43 ParamSearch [[[RankingCC], FeatureSelection], [ExpectedClass, kNN [MetricMa-

chine (EuclideanOUO)]], TransformAndClassify]44 ParamSearch [[[RankingCC], FeatureSelection], [LVQ, kNN (Euclidean)], Trans-

formAndClassify]45 ParamSearch [[[RankingCC], FeatureSelection], [Boosting (10x) [NBC]], Trans-

formAndClassify]46 ParamSearch [[[RankingFScore], FeatureSelection], [kNN (Euclidean)], Transfor-

mAndClassify]47 ParamSearch [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (Eu-

clideanOUO)]], TransformAndClassify]



48 ParamSearch [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (Ma-halanobis)]], TransformAndClassify]

49 ParamSearch [[[RankingFScore], FeatureSelection], [NBC], TransformAndClas-sify]

50 ParamSearch [[[RankingFScore], FeatureSelection], [SVMClassifier [Kernel-Provider]], TransformAndClassify]

51 ParamSearch [[[RankingFScore], FeatureSelection], [LinearSVMClassifier [Linear-KernelProvider]], TransformAndClassify]

52 ParamSearch [[[RankingFScore], FeatureSelection], [ExpectedClass, kNN [Met-ricMachine (EuclideanOUO)]], TransformAndClassify]

53 ParamSearch [[[RankingFScore], FeatureSelection], [LVQ, kNN (Euclidean)],TransformAndClassify]

54 ParamSearch [[[RankingFScore], FeatureSelection], [Boosting (10x) [NBC]],TransformAndClassify]



Task id

Time line0.0 0.25 0.5 0.75 1.0

Accuracy

0.00.51.0

12

3

4

5

6

7

89

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

121212

12

1212

11

10

98

8888

88

77

77

7777

77

7

77

7

7

77

7

7

77

6

5

5

44

33

2

2

11

1

11

11

1

Complexity0.046.21172 Task id

Time line0.0 0.25 0.5 0.75 1.0

Accuracy

0.00.51.0

12

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

1818

18

18

17

17

16

16

1515

1515

1414141414141414

14141414

14

14141414141414

14

1414

14

13

12

11

10

9

9

88

7

66

55

4

3

2

11

Complexity0.01.1984

Mushroom & German numeric



Task id

Time line0.0 0.25 0.5 0.75 1.0

Accuracy

0.00.51.0

1

2

3

4

56

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

2930

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

29

28

28

2727

27

27

2626

25

24

2322

22

21

2019

18

18

18

1717

16

15

15

1414

13

12

12

12

12

12

11

11

11

11

11

11

10

10

9

9

88

7

7

6

5

4

3

2

2

1

Complexity0.00.63452 Task id

Time line0.0 0.25 0.5 0.75 1.0

Accuracy

0.00.51.0

12

3

4

5

6

7

89

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

24

24

23

22

2120

20

2020

20

1919

1919

1818

1818

1717

16

15

14

1313

1313

12

11

1010

9

88

77

77

6

5544

44

4

4444

3

3

2

1

Complexity0.030.04832

Glass & Thyroid


Summary

Summary

Appropriate abstract–level configuring of meta-learning = universalMLA.Flexible generators flow for defining smart search space.Complexity controlled order of searching.System additionally supported by meta-knowledge (from experts andcollected in learning).

Google: W. Duch => Papers & presentationsNorbert: http://www.is.umk.pl/˜norbert/metalearning.htmlKIS: http://www.is.umk.pl => On-line publicationsBook: Meta-learning in Computational Intelligence (coming . . . ).

Have a nice week in Barcelona!


meta -learning : the future of data miningduch/ref/10/barcelona/wcci-meta-learning-nj.pdf · plan...

Documents