meta -learning : the future of data miningduch/ref/10/barcelona/wcci-meta-learning-nj.pdf · plan...
TRANSCRIPT
Meta-Learning:
the future of data mining
Włodzisław DuchNorbert Jankowski, Krzysztof Grąbczewski
+ Tomasz Maszczyk + Marek Grochowski
Department of Informatics, Nicolaus Copernicus University, Toruń, Poland
Google: W. DuchWCCI 2010, Barcelona
Norbert Tomek Marek Krzysztof
Plan
• Problems with Computational intelligence (CI)
• Problems with current approaches to data mining/pattern recognition.
• Meta-learning as search in the space of all models.
• First attempt: similarity based framework for metalearning.
• Heterogeneous systems.
• Hard problems and support features.
• More components to build algorithms.
• Real meta-learning, or algorithms on demand.
What is there to learn?
Brains ... what is in EEG? What happens in the brain?
Industry: what happens with our machines?
Cognitive robotics: vision, perception, language.
Bioinformatics, life sciences.
What can we learn?
What can we learn using pattern recognition, machine lerning, computational intelligence techniques ? Everything?
Neural networks are universal approximators and evolutionary algorithms solve global optimization problems – so everything can be learned? Not at all! All non-trivial problems are hard, need deep transformations.
Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems:
• Uniformly averaged over all target functions the expected error for all learning algorithms [predictions by economists] is the same.
• Averaged over all target functions no learning algorithm yields generalization error that is superior to any other.
• There is no problem-independent or “best” set of features.
“Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.”
In practice: try as many models as you can, rely on your experience and intuition. There is no free lunch, but do we have to cook ourselves?
Data mining packages
• No free lunch => provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees, tools for visualization of data.
• Support the process of knowledge discovery/model building and evaluating, organizing it into projects.
• Many other interesting DM packages of this sort exists: Weka, Yale, Orange, Knime ... 168 packages on the-data-mine.com list!
• We are building Intemi, radically new DM tools.
GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/
• Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GM Developer & Analyzer
What DM packages do?
Hundreds of components ... transforming, visualizing ...
Rapid Miner 5.0, type and # components
Process control 34
Data transformations 111
Data modeling 231
Clustering & segmentation 19
Performance evaluation 30
Text, series, web ... specific transformations.Visualization, presentation, plugin extensions ... ~ billions of models!
Visual “knowledge flow” to link components, or script languages (XML) to define complex experiments.
With all these tools, are we
really so good?
Surprise!
Almost nothing can be learned using such tools!
May the force be with you
Hundreds of components ... billions of combinations ...
Our treasure box is full! We can publish forever! Specialized transformations are still missing in many packages.
What would we really like to have?
Press the button and wait for the truth!
Computer power is with us, meta-learning should replace us in find all interesting data models =sequences of transformations/procedures.
Many considerations: optimal cost solutions, various costs of using feature subsets; simple & easy to understand vs optimal accuracy; various representation of knowledge: crisp, fuzzy or prototype rules, visualization, confidence in predictions ...
Meta-learning
Meta-learning means different things for different people.
Some will call “meta” learning of many models, ranking them, boosting, bagging, or creating an ensemble in many ways , so heremeta optimization of parameters to integrate models.
Landmarking: characterize many datasets and remember which method worked the best on each dataset. Compare new dataset to the reference ones; define various measures (not easy) and use similarity-based methods.
Regression models: created for each algorithm on parameters that describe data to predict expected accuracy, ranking potentially useful algorithms.
Stacking: learn new models on errors of the previous ones.
Deep learning: DARPA 2009 call, methods are „flat”, shallow, build a universal machine learning engine that generates progressively more sophisticated representations of patterns, invariants, correlations from data.Rather limited success so far …
Meta-learning: learning how to learn.
Similarity-based framework
(Dis)similarity: • more general than feature-based description, • no need for vector spaces (structured objects), • more general than fuzzy approach (F-rules are reduced to P-rules), • includes nearest neighbor algorithms, MLPs, RBFs, separable function
networks, SVMs, kernel methods, specialized kernels, and many others!
Similarity-Based Methods (SBMs) are organized in a framework: p(Ci|X;M) posterior classification probability or y(X;M) approximators,models M are parameterized in increasingly sophisticated way.
A systematic search (greedy, beam, evolutionary) in the space of all SBM models is used to select optimal combination of parameters and procedures, opening different types of optimization channels, trying to discover appropriate bias for a given problem.
Results: several candidate models are created, even very limited version gives best results in 7 out of 12 Stalog problems.
SBM framework components
• Pre-processing: objects O => features X, or (diss)similarities D(O,O’). • Calculation of similarity between features d(xi,yi) and objects D(X,Y).• Reference (or prototype) vector R selection/creation/optimization. • Weighted influence of reference vectors G(D(Ri,X)), i=1..k.• Functions/procedures to estimate p(C|X;M) or y(X;M).• Cost functions E[DT;M] and model selection/validation procedures. • Optimization procedures for the whole model Ma.• Search control procedures to create more complex models Ma+1.• Creation of ensembles of (local, competent) models.
• M={X(O), d(.,.), D(.,.), k, G(D), {R}, {pi(R)}, E[.], K(.), S(.,.)}, where:• S(Ci,Cj) is a matrix evaluating similarity of the classes;
a vector of observed probabilities pi(X) instead of hard labels.
The kNN model p(Ci|X;kNN) = p(Ci|X;k,D(.),{DT}); the RBF model: p(Ci|X;RBF) = p(Ci|X;D(.),G(D),{R}), MLP, SVM and many other models may all be “re-discovered” as a part of SBF.
Meta-learning in SBM scheme
Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.
k-NN 67.5/76.6%
+d(x,y);
Canberra 89.9/90.7 %
+ si=(0,0,1,0,1,1);
71.6/64.4 %
+selection,
67.5/76.6 %
+k opt; 67.5/76.6 %
+d(x,y) + si=(1,0,1,0.6,0.9,1);
Canberra 74.6/72.9 %
+d(x,y) + selection;
Canberra 89.9/90.7 %
k-NN 67.5/76.6%
+d(x,y);
Canberra 89.9/90.7 %
+ si=(0,0,1,0,1,1);
71.6/64.4 %
+selection,
67.5/76.6 %
+k opt; 67.5/76.6 %
+d(x,y) + si=(1,0,1,0.6,0.9,1);
Canberra 74.6/72.9 %
+d(x,y) + sel. or opt k;
Canberra 89.9/90.7 %
k-NN 67.5/76.6%
+d(x,y);
Canberra 89.9/90.7 %
+ si=(0,0,1,0,1,1);
71.6/64.4 %
+ranking,
67.5/76.6 %
+k opt; 67.5/76.6 %
+d(x,y) + si=(1,0,1,0.6,0.9,1);
Canberra 74.6/72.9 %
+d(x,y) + selection;
Canberra 89.9/90.7 %
Meta-learning in SBM scheme
Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.
k-NN 67.5/76.6%
+d(x,y);
Canberra 89.9/90.7 %
+ si=(0,0,1,0,1,1);
71.6/64.4 %
+selection,
67.5/76.6 %
+k opt; 67.5/76.6 %
+d(x,y) + si=(1,0,1,0.6,0.9,1);
Canberra 74.6/72.9 %
+d(x,y) + selection;
Canberra 89.9/90.7 %
Real meta-learning!
Meta-learning: learning how to learn, replace experts who search for best models making a lot of experiments.
Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search.
• Abstract view, uniform I/O, uniform results management.
• Directed acyclic graphs (DAG) of boxes representing scheme
• placeholders and particular models, interconnected through I/O.
• Configuration level for meta-schemes, expanded at runtime level.
An exercise in software engineering for data mining!
Intemi, Intelligent Miner
Meta-schemes: templates with placeholders.
• May be nested; the role decided by the input/output types.
• Machine learning generators based on meta-schemes.
• Granulation level allows to create novel methods.
• Complexity control: Length + log(time)
• A unified meta-parameters description, defining the range of sensible values and the type of the parameter changes.
Advanced meta-learning
• Extracting meta-rules, describing interesting search directions.
• Finding the correlations occurring among different items in most accurate results, identifying different machine (algorithmic) structures with similar behavior in an area of the model space.
• Depositing the knowledge they gain in a reusable meta-knowledge repository (for meta-learning experience exchange between different meta-learners).
• A uniform representation of the meta-knowledge, extending expert knowledge, adjusting the prior knowledge according to performed tests.
• Finding new successful complex structures and converting them into meta-schemes (which we call meta abstraction) by replacing proper substructures by placeholders.
• Beyond transformations & feature spaces: actively search for info.
Intemi software (N. Jankowski and K. Grąbczewski) incorporating these ideas and more is coming “soon” ...
Meta-learning architecture
Inside meta-parameter search a repeater machine composed of
distribution and test schemes are placed.
Generating machines
Search process is controlled by a variant of approximated Levin’s complexity: estimation of program complexity combined with time. Simpler machines are evaluated first, machines that work too long (approximations may be wrong) are put into quarantine.
Pre-compute what you can
and use “machine unification” to get substantial savings!
Complexities on vowel data
……………
Simple machines on vowel data
Left: final ranking, gray bar=accuracy, small bars: memory, time & total complexity, middle numbers = process id (models in previous table).
Complex machines on vowel data
Left: final ranking, gray bar=accuracy, small bars: memory, time & total complexity, middle numbers = process id (models in previous table).
Principles: information compression
Neural information processing in perception and cognition: information compression, or algorithmic complexity. In computing: minimum length (message, description) encoding.
Wolff (2006): all cognition and computation is information compression! Analysis and production of natural language, fuzzy pattern recognition, probabilistic reasoning and unsupervised inductive learning.
Talks about multiple alignment, unification and search, but
so far only models for sequential data and 1D alignment.
Information compression: encoding new information in terms of old has been used to define the measure of syntactic and semantic information (Duch, Jankowski 1994); based on the size of the minimal graph representing a given data structure or knowledge-base specification, thus it goes beyond alignment.
Knowledge transfer
Brains learn new concepts in terms of old; use large semantic network and add new concepts linking them to the known.
Knowledge should be transferred between the tasks. Not just learned from a single dataset.
Need to discover good building blocks for higher level concepts/features.
Maximization of margin/regularization
Among all discriminating hyperplanes there is one defined by support vectors that is clearly better.
LDA in larger space
Suppose that strongly non-linear borders are needed.
Use LDA, but add new dimensions, functions of your inputs!
Add to input Xi2, and products Xi Xj, as new features.
Example: 2D => 5D case Z={z1…z5}={X1, X2, X12, X2
2, X1X2}
The number of such tensor products grows exponentially – no good.
Fig. 4.1
Hasti et al.
Kernels = similarity functions
Gaussian kernels in SVM: zi (X)=G(X;XI ,s) radial features, X=>ZGaussian mixtures are close to optimal Bayesian errors. Solution requires continuous deformation of decision borders and is therefore rather easy.
Support Feature Machines (SFM): construct features based on projections, restricted linear combinations, kernel features, use feature selection.
Gaussian kernel, C=1.In the kernel space Z decision borders are flat, but in the X space highly non-linear!
SVM is based on quadratic solver, without explicit features, but using Z features explicitly has some advantages: Multiresolution: different s for different support features, or using several kernelszi (X)=K(X;XI ,s) in one set of features.Linear solvers, or Naïve Bayes, or any other algorithms may be used.
Neural networks: thyroid screening
Garavan Institute, Sydney, Australia
15 binary, 6 continuous
Training: 93+191+3488 Validate: 73+177+3178
• Determine importantclinical factors
• Calculate prob. of each diagnosis.
Hidden
units
Final
diagnoses
TSH
T4U
Clinical
findings
Age
sex
…
…
T3
TT4
TBG
Normal
Hyperthyroid
Hypothyroid
Poor results of SBL or SVM … see summary at this addresshttp://www.is.umk.pl/projects/datasets.html#Hypothyroid
SVNT algorithm
Initialize the network parameters W, set De=0.01, emin=0, set SV=T (training set).
Until no improvement is found in the last Nlast iterations do:
• Optimize network parameters for Nopt steps on SV data.
• Run feedforward step on SV to determine overall accuracy and errors, make new SV={X | e(X) [emin,1-emin]}.
• If the accuracy increases:
compare current network with the previous best one, choose the better one as the current best.
• increase emin=emin+De and make forward step selecting SVs
• If the number of support vectors |SV| increases:
decrease emin=emin-De;
decrease De = De/1.2 to avoid large changes
Hypothyroid data
2 years real medical screening tests for thyroid diseases, 3772 cases with 93 primary hypothyroid and 191 compensated hypothyroid, the remaining 3488 cases are healthy; 3428 test, similar class distribution.
21 attributes (15 binary, 6 continuous) are given, but only two of the binary attributes (on thyroxine, and thyroid surgery) contain useful information, therefore the number of attributes has been reduced to 8.
Method % train % test
SFM, SSV+2 B1 features ------- 99.6
SFM, SVMlin+2 B1 features ------- 99.5
MLP+SCG, 4 neurons 99.8 99.2
Cascade correlation 100 98.5
MLP+backprop 99.6 98.5
SVM Gaussian kernel 99.8 98.4
SVM lin 94.1 93.3
Hypothyroid data
Heterogeneous systems
Problems requiring different scales (multiresolution).
2-class problems, two situations:
C1 inside the sphere, C2 outside.MLP: at least N+1 hyperplanes, O(N2) parameters. RBF: 1 Gaussian, O(N) parameters.
C1 in the corner defined by (1,1 ... 1) hyperplane, C2 outside.MLP: 1 hyperplane, O(N) parameters. RBF: many Gaussians, O(N2) parameters, poor approx.
Combination: needs both hyperplane and hypersphere!
Logical rule: IF x1>0 & x2>0 THEN C1 Else C2
is not represented properly neither by MLP nor RBF!
Different types of functions in one model, first step beyond inspirations from single neurons => heterogeneous models.
Heterogeneous everything
Homogenous systems: one type of “building blocks”, same type of decision borders, ex: neural networks, SVMs, decision trees, kNNs
Committees combine many models together, but lead to complex models that are difficult to understand.
Ockham razor: simpler systems are better. Discovering simplest class structures, inductive bias of the data, requires Heterogeneous Adaptive Systems (HAS).
HAS examples:NN with different types of neuron transfer functions.k-NN with different distance functions for each prototype.Decision Trees with different types of test criteria.
1. Start from large networks, use regularization to prune.2. Construct network adding nodes selected from a candidate pool.3. Use very flexible functions, force them to specialize.
Taxonomy - TF
HAS decision trees
Decision trees select the best feature/threshold value for univariate
and multivariate trees:
Decision borders: hyperplanes.
Introducing tests based on La Minkovsky metric.
or ; ,i k k i i k
i
X T W X X W
Such DT use kernel features!
For L2 spherical decision border are produced.
For L∞ rectangular border are produced.
For large databases first clusterize data to get candidate references R.
1/
; , R i i R
i
T X Ra
a X R X R
SSV HAS DT example
SSV HAS tree in GhostMiner 3.0, Wisconsin breast cancer (UCI)699 cases, 9 features (cell parameters, 1..10)Classes: benign 458 (65.5%) & malignant 241 (34.5%).
Single rule gives simplest known description of this data:
IF ||X-R303|| < 20.27 then malignant
else benign coming most often in 10xCV
Accuracy = 97.4%, good prototype for malignant case!
Gives simple thresholds, that’s what MDs like the most!
Best 10CV around 97.5±1.8% (Naïve Bayes + kernel, or SVM)
SSV without distances: 96.4±2.1%
C 4.5 gives 94.7±2.0%
Several simple rules of similar accuracy but different specificity or sensitivity may be created using HAS DT. Need to select or weight features and select good prototypes.
How much can we learn?
Linearly separable or almost separable problems are relatively
simple – deform or add dimensions to make data separable.
How to define “slightly non-separable”?
There is only separable and the vast realm of the rest.
Linear separability
QPC projection used to visualize Leukemia microarray data.
2-separable data, separated in vertical dimension.
Approximate separability
QPC visualization of Heart dataset: overlapping clusters, information in the data is insufficient for perfect classification, approximately 2-separable.
Easy problems
• Approximately linearly separable problems in the original feature space: linear discrimination is sufficient (always worth trying!).
• Simple topological deformation of decision borders is sufficient – linear separation is then possible in extended/transformed spaces.
This is frequently sufficient for pattern recognition problems (half of UCI problems?).
• RBF/MLP networks with one hidden layer also solve such problems easily, but convergence/generalization for anything more complex than XOR is problematic.
SVM adds new features to “flatten” the decision border:
( )
1 2( , ,... ); ,i
n ix x x z K X X X X
achieving larger margins/separability in the X+Z space.
Neurons learning complex logic
Boole’an functions are difficult to learn, n bits but 2n nodes => combinatorial complexity; similarity is not useful, for parity all neighbors are from the wrong class. MLP networks have difficulty to learn functions that are highly non-separable.
Projection on W=(111 ... 111) gives clusters with 0, 1, 2 ... n bits;
easy categorization in (n+1)-separable sense.
Ex. of 2-4D parity problems.
Neural logic can solve it without counting; find a good point of view.
Easy and difficult problems
Linear separation: good goal if simple topological deformation of decision borders is sufficient.
Linear separation of such data is possible in higher dimensional spaces; this is frequently the case in pattern recognition problems.
RBF/MLP networks with one hidden layer solve such problems.
Difficult problems: disjoint clusters, complex logic.
Continuous deformation is not sufficient; networks with localized functions need exponentially large number of nodes.
Boolean functions: for n bits there are K=2n binary vectors that can be represented as vertices of n-dimensional hypercube.
Each Boolean function is identified by K bits.
BoolF(Bi) = 0 or 1 for i=1..K, leads to the 2K Boolean functions.Ex: n=2 functions, vectors {00,01,10,11},
Boolean functions {0000, 0001 ... 1111}, ex. 0001 = AND, 0110 = OR,
each function is identified by number from 0 to 15 = 2K-1.
Boolean functions
n=2, 16 functions, 12 separable, 4 not separable.
n=3, 256 f, 104 separable (41%), 152 not separable.
n=4, 64K=65536, only 1880 separable (3%)
n=5, 4G, but << 1% separable ... bad news!
Existing methods may learn some non-separable functions, but most functions cannot be learned !
Example: n-bit parity problem; many papers in top journals.
No off-the-shelf systems are able to solve such problems.
For all parity problems SVM is below base rate!
Such problems are solved only by special neural architectures or special classifiers – if the type of function is known.
But parity is still trivial ... solved by 1
cosn
i
i
y b
Goal of learning
If simple topological deformation of decision borders is sufficient linear separation is possible in higher dimensional spaces, “flattening” non-linear decision borders, kernel approaches are sufficient. RBF/MLP networks with one hidden layer solve the problem.
This is frequently the case in pattern recognition problems.
For complex logic this is not sufficient; networks with localized functions need exponentially large number of nodes.
Such situations arise in AI reasoning problems, real perception, object recognition, text analysis, bioinformatics ...
Linear separation is too difficult, set an easier goal. Linear separation: projection on 2 half-lines in the kernel space:
line y=WX, with y<0 for class – and y>0 for class +.
Simplest extension: separation into k-intervals, or k-separability.
For parity: find direction W with minimum # of intervals, y=W.X
What NN components really do?
Vector mappings from the input space to hidden space(s) and to the output space + adapt parameters to improve cost functions.
Hidden-Output mapping done by MLPs:
T = {Xi} training data, N-dimensional.
H = {hj(T)} X image in the hidden space, j =1 .. NH-dim.
... more transformations in hidden layers
Y = {yk(H )} X image in the output space, k =1 .. NC-dim.
ANN goal:
data image H in the last hidden space should be linearly separable; internal
representations will determine network generalization.
But we never look at these representations!
8-bit parity solution
QCP solution to 8-bit parity data: projection on W=*1,1…1+ diagonal.
k-separability is much easier to achieve than full linear separability.
Network solution
Can one learn a simplest model for arbitrary Boolean function?
2-separable (linearly separable) problems are easy; non separable problems may be broken into k-separable, k>2.
Blue: sigmoidal neurons with threshold, brown –linear neurons.
X1
X2
X3
X4
y=W.X +
1
1
+
1
1
s(by+1)
s(by+2)
+
1
+
1+
1+
1
s(by+4)
Neural architecture for k=4 intervals, or 4-separable problems.
QPC Projection Pursuit
What is needed to learn data with complex logic?
• cluster non-local areas in the X space, use W.X
• capture local clusters after transformation, use G(W.X-)
SVMs fail because the number of directions W that should be
considered grows exponentially with the size of the problem n.
What will solve it? Projected clusters!
1. A class of constructive neural network solution with G(W.X-) functions combining non-local/local projections, with special training algorithms.
2. Maximize the leave-one-out error after projection: take some localized function G, count in a soft way cases from the same class as Xk.
Grouping and separation; projection may be done directly to 1 or 2D for visualization, or higher D for dimensionality reduction, if W has d columns.
k k
k k
C C
Q A G A G
X X X
W W X X W X X
Parity n=9
Simple gradient learning; quality index shown below.
Learning hard functions
Training almost perfect for parity, with linear growth in the number of vectors for k-sep. solution created by the constructive neural algorithm.
Real data
On simple data results are similar as from SVM (because they are almost optimal), but models are much simpler.
Rules
QPC visualization of Monks artificial symbolic dataset, => two logical rules are needed.
Complex distribution
QPC visualization of concentric rings in 2D with strong noise in remaining 2D; transform: nearest neighbor solutions, combinations of ellipsoidal densities.
Example: aRPM
Almost Random Projection Machine (with Hebbian learning):
generate random combinations of inputs (line projection) z(X)=W.X,
find and isolate pure cluster h(X)=G(z(X)); estimate relevance of h(X), ex. MI(h(X),C), leave only good nodes;
continue until each vector activates minimum k nodes.
Count how many nodes vote for each class and plot.
Learning from others …
Learn to transfer knowledge by extracting interesting features created by different systems. Ex. prototypes, combinations of features with thresholds …
=> Universal Learning Machines.
Example of feature types:
B1: Binary – unrestricted projections b1
B2: Binary – complexes b1 ᴧ b2 … ᴧ bk
B3: Binary – restricted by distance
R1: Line – original real features ri; non-linear thresholds for “contrast enhancement“ s(ribi); intervals (k-sep).
R4: Line – restricted by distance, original feature; thresholds; intervals (k-sep); more general 1D patterns.
P1: Prototypes: general q-separability, weighted distance functions or specialized kernels.
M1: Motifs, based on correlations between elements rather than input values.
1 1 1 2 2 20 , , ...ib r r r r r r
B1/B2 Features
Dataset B1/B2 Features
Australian F8 < 0.5 F8 ≥ 0.5 ᴧ F9 ≥ 0.5
Appendicitis F7 ≥ 7520.5 F7 < 7520.5 ᴧ F4 < 12
Heart F13 < 4.5 ᴧ F12 < 0.5 F13 ≥ 4.5 ᴧ F3 ≥ 3.5
Diabetes F2 < 123.5 F2 ≥ 143.5
Wisconsin F2 < 2.5 F2 ≥ 4.5
Hypothyroid F17 < 0.00605 F17 ≥ 0.00605 ᴧ F21 < 0.06472
Example of B1 features taken from segments of decision trees.These features used in various learning systems greatly simplify their models and increase their accuracy. Almost all systems reach similar accuracy!
Dataset Classifier
SVM (#SV) SSV (#Leafs) NB
Australian 84.9±5.6 (203) 84.9±3.9 (4) 80.3±3.8
ULM 86.8±5.3(166) 87.1±2.5(4) 85.5±3.4
Features B1(2) + P1(3) B1(2) + R1(1) + P1(3) B1(2)
Appendicitis 87.8±8.7 (31) 88.0±7.4 (4) 86.7±6.6
ULM 91.4±8.2(18) 91.7±6.7(3) 91.4±8.2
Features B1(2) B1(2) B1(2)
Heart 82.1±6.7 (101) 76.8±9.6 (6) 84.2±6.1
ULM 83.4±3.5(98) 79.2±6.3(6) 84.5±6.8
Features Data + R1(3) Data + R1(3) Data + B1(2)
Diabetes 77.0±4.9 (361) 73.6±3.4 (4) 75.3±4.7
ULM 78.5±3.6(338) 75.0±3.3(3) 76.5±2.9
Features Data + R1(3) + P1(4) B1(2) Data + B1(2)
Wisconsin 96.6±1.6 (46) 95.2±1.5 (8) 96.0±1.5
ULM 97.2±1.8(45) 97.4±1.6(2) 97.2±2.0
Features Data + R1(1) + P1(4) R1(1) R1(1)
Hypothyroid 94.1±0.6 (918) 99.7±0.5 (12) 41.3±8.3
ULM 99.5±0.4(80) 99.6±0.4(8) 98.1±0.7
Features Data + B1(2) Data + B1(2) Data + B1(2)
T-based metalearning
To create successful meta-learning through search in the model space fine granulation of methods is needed, extracting info using support features, learning from others, knowledge transfer and deep learning.
Learn to compose, using complexity guided search, various transformations (neural or processing layers), for example:
• Creation of new support features: linear, radial, cylindrical, restricted localized projections, binarized … feature selection or weighting.
• Specialized transformations in a given field: text, bio, signal analysis, ….
• Matching pursuit networks for signal decomposition, QPC index, PCA or ICA components, LDA, FDA, max. of mutual information etc.
• Transfer learning, granular computing, learning from successes: discovering interesting higher-order patterns created by initial models of the data.
• Stacked models: learning from the failures of other methods.
• Schemes constraining search, learning from the history of previous runs at the meta-level.
Summary
• Challenging data cannot be handled with existing DM tools.
• Similarity-based framework enables meta-learning as search in the model space, heterogeneous systems add fine granularity.
• No off-shelf classifiers are able to learn difficult Boolean functions.
• Visualization of hidden neuron’s shows that frequently perfect but non-separable solutions are found despite base-rate outputs.
• Linear separability is not the best goal of learning, other targets that allow for easy handling of final non-linearities should be defined.
• k-separability defines complexity classes for non-separable data.
• Transformation-based learning shows the need for component-based approach to DM, discovery of simplest models and support features.
• Meta-learning replaces data miners automatically creating new optimal learning methods on demand.
Is this the final word in data mining? Only the future will tell.
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Contents
1 What is meta-learning?
2 Committees of decision models
3 Meta-level regression
4 Rankings of algorithms
5 Meta-learning as a search process
2 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
What is meta-learning?
I Generally, meta-learning encompasses all e�orts to learn
how to learn including gathering meta-knowledge and
using meta-knowledge in further learning.
I Meta-knowledge is knowledge about learning processes,
about in�uence of machine parameters on �nal results, etc.
During last two decades, the term meta-learning has been
used in many di�erent contexts:
I building committees of decision models,
I building regression models predicting machine accuracy,
I building algorithms rankings for given datasets,
I searching through spaces of learning machines
parameters augmented by meta-knowledge and gathering
new meta-knowledge.
3 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Committees of decision models
Decisionmodule
Member 1 . . . Member k
I Simple committees do not learn at meta-level: e.g.
simple majority voting.
I Some �intelligent� decision modules perform meta-analysis.
� Bagging, arcing, boosting � perform somemeta-analysis to build more stable decision makers(Dietterich 1997) and are very popular, but this is notexactly what we would name �meta-learning�.
� Stacking � the decision module is a meta-level learner.� Many advanced, heterogeneous, undemocratic
committees have been published.
4 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Stacking
I Learning machines are trained on results of a group of
models.
I Stolfo et al. (1997), Prodromidis et al. (2000) � JAM
(Java Agents for Meta-learning) � a parallel, distributed
system for scalable computing.
I Todorovski and Dzeroski (2003) � Meta Decision Trees
� properly adapted C4.5 decision trees determine which
model to use.
I NOEMON � Kalousis and Theoharis (1999), Kalousis and
Hilario (2000) � also called stacking a meta-learning.
5 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Undemocratic committees
Meta-analysis may lead to estimation of the areas and degrees
of competence of each base learner to provide more
reasonable decision of the decision module.
I Chan and Stolfo (1993, 1998) �� Meta-learning by arbitration and combining.� Arbiters: binary tree of arbiters (members organized in
pairs, arbiter for each pair, arbiters in pairs, and so on),� Combiners: a sort of stacking.� Combiners compute a prediction that may be entirely
di�erent from any proposed by base models, whereasarbiters choose one of the predictions of the base models.
I Duch and Itert (2003) de�ne incompetence functions
that describes member (in)competence in particular points
of the data space.
I Jankowski and Gr¡bczewski (2005) re�ect global and
local competence in �nal ensemble decisions.6 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Learnt Topology GANNs
Kadlec and Gabrys (2008): LTGANN - Learnt Topology
Gating Arti�cial Neural Networks
I Each committee member consists of two networks: local
expert and gating network.
I Local expert learns the task given to the committee.
I Gating network learns performance of the local
expert. It links the expert performance to positions of
samples in the input space.
I Final decision is made by weighting.
I Flexibility for the ANN topology selection: there is no
need to de�ne the exact number of hidden units either
for the local experts or for the gating networks.
7 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Meta-level regression
I Regression methods predict accuracies of di�erent
learning machines on the basis of dataset descriptions.
I Koepf et al. (2000), Bensusan and Kalousis (2001):
� Input: dataset description as a series of values derivedfrom information theory and statistics.
� Output: accuracy of the model (usually classi�er).
I Ranking learning machines:
� One regression model for each algorithm to rank.� Machines are ranked in the decreasing order of predicted
accuracy.
8 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Rankings of algorithms
I The most popular approach initiated by (probably largest
so far) meta-learning project MetaL (1998-2002).
I Rankings learned from simple descriptions of data.
DataMeta-featuresdescribing
data
Ranking ofalgorithms
I Meta-attributes are basic data characteristics: number of
instances, number of features, types of features (continuous
or discrete, how many of which), data statistics etc.I Rankings are generated by meta-learners:
� for each pair of algorithms to be ranked, a classi�cationalgorithm is trained on two-class datasets describing winsand losses of the algorithms on some collection of datasets,
� decisions of meta-classi�ers are combined to build �nalranking.
9 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Algorithm selection problem
Algorithm selection problem (ASP)
I May be regarded as equivalent to building algorithm
rankings.
I ASP was addressed already by Rice (1974):D ∈ Dproblem
space
A ∈ Aalgorithm
space
m ∈ Rn
performance
measure space
||m|| ∈ Rperformance
norm
S(D) p(A, D) || · ||
I Most often, it gets reduced to the problem of assigning
optimal algorithm to a vector of features describing data,
which is quite restrictive.
10 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
No Free Lunch theorems
No free lunch theorems, in this context, may be expressed as:
Each single learning algorithm tested on all possible data-
sets will be, on average, as accurate as random choice.
So does building learning machines make any sense?
I Yes, because all possible datasets is what makes NFL
provable but useless!
I In the context of training and test, �all possible� means
also those, where training and test come from completely
di�erent distributions, are completely unrelated.
I We expect training data representative for the population
and NFL does not care about representativeness.
I Inductive bias of algorithms is not an explanation.
Conclusion: let's not pay much attention to NFL!11 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Landmarking
I Pfahringer et al. (2000) � the idea of landmarking: usingmeta-features measuring the performance of somesimple and e�cient learning algorithms (landmarkers).
� linear discriminant learner,� naive bayes learner,� C5.0 tree learner.
Meta-learners used:
� C5.0 trees and rules,� boosted C5.0,� RIPPER,� LTREE,� linear discriminant,� naive bayes,� nearest neighbor.
12 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Landmarking continued
Fürnkranz and Petrak (2001):
I Relative landmarking � meta-attributes describe relationsbetween results instead of accuracies:
� Ranks of landmarkers,� Order of landmarkers (inverse of ranks),� Pairwise comparisons between accuracies of landmarkers
(+1, -1, ?),� Pairwise accuracies ratios (continuous).
I Subsampling � original datasets reduced to facilitate
landmarking by algorithms of larger computational
complexity.
13 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Landmarking continued
Soares et al. (2001)
I Relative landmarking and subsampling combined.
I Adjusted ratio of ratios (ARR) index � a combination
of accuracy and time to assess relative performance:
ARRdi,j =
Adi
Adj
1 + log
(T di
T dj
)∗X
Adi and T d
i are accuracy and time of i'th landmarker on
data d, X is a parameter: �the amount of accuracy we are
willing to trade for 10-times speed-up�.
I When n > 2 algorithms are involved, they calculate
relative landmark:
rldi =
∑j 6=iARRd
i,j
n− 1.
14 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Still landmarking. . .
Brazdil and Soares (2000), Brazdil et al. (2003)
I more advanced statistical measures of datasets
(including histogram analysis and information theory based
indices) as meta-attributes,
I k nearest neighbor (kNN) algorithm choses similar
datasets; ranking created from results obtained by ranked
algorithms on the nearest neighbors,
I methods of combining results to create rankings:� ARR � adjusted ratio of ratios,� counting statistically signi�cant di�erences in results:
average ranks (AR) and signi�cant wins (SW).� ranking methods estimated by comparison to the ideal
ranking
• Spearman's rank correlation coe�cient,• Friedman's signi�cance test,• Dunn's multiple comparison technique.
15 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
NOEMON
Kalousis and Theoharis (1999), Kalousis and Hilario (2000)
I NOEMON is a system advising which classi�er to use and
performing classi�cation.
I Meta-learning space contains data characteristics
(statistics, concentration histograms) and performance
measures of algorithms.
I Di�erent performance measures gathered: accuracy,
training time, execution time, resource demand.
I Meta-learner trained on results of pairs of algorithms
� kNN,� decision trees.
I Model selection based on comparisons of results of pairs
of algorithms.
16 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Other landmarking related approaches
I DecT by Peng et al. (2002)� Data characteristics derived from the structure of C5.0
decision trees built on the data.� Like in other approaches:
• kNN to select similar datasets,• rankings by ARR,• Spearman's correlation coe�cient to estimate rankings.
I Bensusan et al. (2000)� Landmarking and decision trees techniques combined.� Typed higher-order inductive learning directly from
decision trees instead of trees characteristics.I Todorovski et al. (2002)
� Meta-data obtained from statistics, information theory andlandmarking.
� Predictive Clustering Trees � multi-split decision trees• minimization of intra-clusters variance and maximization
of inter-clusters variance, clusters contain data with
similar relative performance of algorithms,• ranks instead of accuracies � ranking trees. 17 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Rankings of algorithms � general remarks
I Very naive approach: simple data transformation may
completely change the ranking.
I Resembles common approach to split data analysis process
into data preprocessing stage and �nal learning.
I We are not interested in raw rankings, but in complex
machine combinations that model the data as accurately
as possible.
I Even very accurate rankings do not give hints about
data transformations that could improve the results.
I No human expert would use such technique to select most
promising learning algorithms.
18 / 19
Meta-learningresearchsurvey
W. Duch, K.Gr¡bczewski,N. Jankowski
What ismeta-learning?
Committeesof decisionmodels
Meta-levelregression
Rankings ofalgorithms
Meta-learning as asearchprocess
Meta-learning as an advanced search
process
I Fundamental aim of meta-learning is to be more
successful in object-level (base-level) learning.I What do human experts do to obtain optimal model for
given data?� search for solutions by testing subsequent candidates,� test candidates not at random but after selection and in
order based on some meta-knowledge,� gain new meta-knowledge (general and speci�c to the
task being solved) while learning.I Automated meta-learning should mimic behavior of
human experts. Therefore, in our approach, we:� generate candidate machine con�gurations according
to meta-knowledge (initially from human experts),� order candidates with special complexity measure,� test candidates to create a ranking,� gather new meta-knowledge and re�ne human experts'
meta-knowledge to successfully drive the search process.19 / 19
Universal meta-learning architecture and algorithms
(DI/NCU) Meta-learning—WCCI 2010 1 / 48
Learning and meta-learning
Learning and meta-learning
A learning problem can be defined as P = 〈D,M〉, where D ⊆ D is alearning dataset andM is a model space.Learning process of a learning machine L is a function A(L):
A(L) : KL ×D →M (1)
Machine
Input 1...Input n
Output 1...Output m
Machine processparameters
Resultsrepository
Meta-learning is another or rather specific learning of machine.ML = Learn how to learn, to learn as well as possible.
(DI/NCU) Meta-learning—WCCI 2010 2 / 48
Learning and meta-learning Decomposition of learning problem
Decomposition of learning problem
To decompose the problem P = 〈D,M〉 into subproblems:
P = [P1, . . . ,Pn], Pi = 〈Di ,Mi 〉 (2)
In result model for the main problem P is:
m = [m1, . . . ,mn], (3)
and the model space gets the form
M =M0 × . . .×Mn. (4)
The solution constructed by decomposition is often much easier tofind.
(DI/NCU) Meta-learning—WCCI 2010 3 / 48
Learning and meta-learning Decomposition of learning problem
The model mi solving the subproblem Pi , is the result of learning process
A(Li ) : KLi ×Di →Mi , i = 1, . . . , n, (5)
whereDi =
∏
k∈Ki
Mk , (6)
and Ki ⊆ {0, 1, . . . , i − 1},M0 = D.The main learning process L is decomposed to the vector
[L1, . . . ,Ln]. (7)
Data loader
Data
Vector selector
Data Data
Classifier
Data Classifier
Such decomposition is often very natural: a standardization orfeature selection naturally precedes classification, a number ofclassifiers precede committee module etc.
(DI/NCU) Meta-learning—WCCI 2010 4 / 48
Learning and meta-learning Decomposition of learning problem
Goal of meta-learning
Finding an optimal model for given data mining problem P is almostalways NP hard.Meta-learning could find solutions which at least are not worst thanthe ones that can be found by human experts in data mining in givenlimited amount of time.Experts usually restrict their tests to a part of algorithms.
Goal of meta-learning isTo maximize the probability of finding possibly best solution within asearch space of given problem P in as short time as possible.
Complexity is used to order machines.
(DI/NCU) Meta-learning—WCCI 2010 5 / 48
Learning and meta-learning Decomposition of learning problem
Configuring Meta-learning Problem
MLA need NOT be a mistery!
Appropriate configuring of MLA should enable find solution of anyP.Ex.: If we want use it for classification or approximation, don’t fix it insideMLA!
Configuration of MLA is be specified byStating of Functional Searching SpaceDefining the Goal of Meta-learning (of problem P)Defining the Stop ConditionDefining the Attractiveness ModuleDefining Initial Meta-knowledge
Above elements of configuration guarantees that algorithm is universal.(DI/NCU) Meta-learning—WCCI 2010 6 / 48
Learning and meta-learning Search space = generators flow
Search space of MLA
Machine Configuration Generators & Generators Flow
Fixed set of LM inside MLA is not flexible and strongly limits MLA!The goal of machine configuration generators (MCG) is toprovide/produce machine configurations.The goal of generators flow is to provide machine configuration toMLA.Generators flow is a graph (DAG) of machine configuration generators.Each MCG may base on different meta-knowledge and may revealdifferent behavior, which in particular may even change in time (duringthe meta-learning progress).
Simplest generators flow
output
Classifiers setGenerator
(DI/NCU) Meta-learning—WCCI 2010 7 / 48
Learning and meta-learning Search space = generators flow
Set-based Generators
The simplest and very useful machine generator is based on the idea toprovide just an arbitrary (initial) sequence of machine configurations.Usually, it is convenient to have a few set-base generators in singlegenerators flow
set-base generator of simple classifiersset-base generator of classifiersset-base generator of feature rankingset-base generator of prototype selectorsset-base generator of committees . . .
(DI/NCU) Meta-learning—WCCI 2010 8 / 48
Learning and meta-learning Search space = generators flow
Schemes and Machine Configuration Templates
A machine configuration with empty scheme (or empty schemes) as childmachine configuration is called a machine configuration template.
Feature selection template
Data
Data
Transformation
Ranking
Data Feature ranking
Feature selection
Data
Feature ranking
Data
Transformation
The role of empty scheme is defined by type of inputs and outputs.
(DI/NCU) Meta-learning—WCCI 2010 9 / 48
Learning and meta-learning Search space = generators flow
Template-based Generators
Transform and classify
Data Classifier
Transformation scheme
Data
Data
Transformation
Classifier scheme
Data Classifier
kNNData Classifier
Transform and classify
Data Classifier
Transformation scheme
Data
Data
Transformation
Classifier scheme
Data Classifier
Simple generators flow
output
Classifiers setGenerator
Transformers setGenerator
Transform andclassify machineGenerator
(DI/NCU) Meta-learning—WCCI 2010 10 / 48
Learning and meta-learning Search space = generators flow
Generator flow example
Generators flow
flow output
TransformerGenerator
RankingGenerator
ClassifiersGenerator
Feature Selectionof RankingsGenerator
Transform andClassify Generator II
Transform andClassify Generator
MPS/FS ofTransform & ClassifyGenerator
MPS for ClassifiersGenerator
(DI/NCU) Meta-learning—WCCI 2010 11 / 48
Learning and meta-learning Search space = generators flow
A part of generator flow (1) . . .
Part of generators flow
flow output
Ranking DiscrGenerator
RankingGenerator
ClassifiersGenerator
Discr & RankGenerator
Feature Selectionof RankingsGenerator
Transform andClassify Generator II
MPS/FS ofTransform & ClassifyGenerator
(DI/NCU) Meta-learning—WCCI 2010 12 / 48
Learning and meta-learning Search space = generators flow
Schemes of a committee & generator flow with committee
Committee scheme
Data Classifier
Classifiers
Data Classifier
Committee
Classifier Classifier
Part of Generators flow (2)
flow output
Committee modulesGenerator
Classifier SequencesGenerator
Committee Schemesgenerator
(DI/NCU) Meta-learning—WCCI 2010 13 / 48
Learning and meta-learning Search space = generators flow
Advanced Generators
Advanced generators can learn from meta-knowledge,half-independently of the heart of the meta-learning algorithm, andbuild their own specialized meta-knowledge.Advanced generators are informed each time a test task is finished.The generators may read the results of test task and the currentmachine configuration ranking.Different generators may realize different strategies. No singleperfect strategy.
Part of Generators flow with intelli Comm
flow output
Committee modulesGenerator
Intelligent Committeegenerator
Intelligent committee generator observe progress of meta-learning.Model diversity may be simply applied.
(DI/NCU) Meta-learning—WCCI 2010 14 / 48
Learning and meta-learning Search space = generators flow
Generation of ε-machines
Part of Generators flow with ε-machines
flow output
Some baseclassifiers Generator
ε-generator
In some cases then original data are huge. It is important to start withsome so sharp simplified machines—such that based on shrunk data.Filter by # vectors, # features. Provide independent configurations
(DI/NCU) Meta-learning—WCCI 2010 15 / 48
Configuring of meta-learning
Defining the Goal of Meta-learning for particular problem PTest procedure + evaluation of results
Strict and flexible definition of the goal!Repeater
Data
Distributor scheme
Data
Training data
Test data
Test scheme
Training data
Test data
CV distributor
Data
Training data
Test data
ClassifierData Classifier
Classification testData
Classifier
Test procedure finished ⇒ Results evaluation ≡ QueryQuery definition = evaluation of results:
which nodes, which values, how to transform
(DI/NCU) Meta-learning—WCCI 2010 16 / 48
Configuring of meta-learning
Defining Initial Meta-knowledge
Initial meta-knowledgeandtheir different kinds. . .This touch many substructures which are another aspect ofconfiguration.For example MPS and their meta-knowledge.Knowledge of advanced generators.and more . . .but no common–single meta-knowledge!
(DI/NCU) Meta-learning—WCCI 2010 17 / 48
Meta-learning algorithm General meta-learning scheme
General meta-learning scheme
START initialize
stopcondition
evaluateresults
start sometest tasks finalize
wait forany task STOP
yes
no
(DI/NCU) Meta-learning—WCCI 2010 18 / 48
Meta-learning algorithm Test Tasks Starting
Test Tasks Starting
From generators flow to heap.Heap is ordered by complexity.
16 procedure startTasksIfPossible;17 while (¬ machinesHeap.Empty() ∧ ¬ mlTaskSpooler.Full())18 {19 <mc, cmplx> = machinesHeap.ExtractMinimum();20 timeLimit = τ · cmplx.time / cmplx.q21 mlTaskSpooler.Add(mc, limiter(timeLimit), priority−−);22 }23 end
Machine heap vs. task spooler.The complexity is approximated for given configuration.To bypass thehalting problem and the problem of (the possibility of) inaccurateapproximation == Test task time limit.
(DI/NCU) Meta-learning—WCCI 2010 19 / 48
Meta-learning algorithm Analysis of Finished Tasks
Analysis of Finished Tasks
Finish: normally or halted by (time-)limiter
24 procedure analyzeFinishedTasks;25 foreach (t in mlTaskSpooler.finishedTasks) {26 mc = t.configuration;27 if (t.status = finished_normally) {28 qt = computeQualityTest(t, queryDefinition);29 machinesRanking.Add(qt, mc);30 if(attractivenessModule is defined)31 attractivenessModule.Analyze(t, qt, machinesRanking);32 machineGeneratorsFlow.Analyze(t, qt, machinesRanking);33 } else { // task broken by limiter34 mc.cmplx.q = mc.cmplx.q / 4;35 machinesHeap.Quarantine(mc);36 }37 mlTaskSpooler.RemoveTask(t);38 }39 end
(DI/NCU) Meta-learning—WCCI 2010 20 / 48
Machine Complexity Evaluation Complexity measure
Which complexity?
CL(P) = minp{c∗(p) : program p prints P in time tp} (8)
Kolmogorov (?):ck(p) = l(p) (9)
Levin (?):cl (p) = l(p) + log tp (10)
The problem of using above def — too small time influence.Ex.: a program running 1024 times longer than another one may havejust a little bigger complexity: just +10.
More realistic version:
ca(p) = l(p) + tp/ log tp (11)
(DI/NCU) Meta-learning—WCCI 2010 21 / 48
Machine Complexity Evaluation Complexity measure
Quarantine and penalty term
Complexity approximation ⇒ errors.Quarantine + penalty term:
cb(p) = [l(p) + tp/ log tp]/q(p) (12)
where q(p) reflect an estimate of reliability of p.No halting problem inside tested machines.This mechanism prevents from running test tasks for unpredictablylong time.If test task broke the time limit, the reliability decrease and testtask come back to the machine heap.
(DI/NCU) Meta-learning—WCCI 2010 22 / 48
Machine Complexity Evaluation Complexity measure
Attractiveness
The meta-knowledge may influence the order of test taskswaiting in the machine heap and machine configurations which will beprovided during the process.The optimal way of doing this, is adding a new term to the cb(p) toshift the start time of given test in appropriate direction:
cc(p) = [l(p) + tp/ log(tp)]/[q(p) · a(p)]. (13)
a(p) reflects the attractiveness of the test task p.
(DI/NCU) Meta-learning—WCCI 2010 23 / 48
Machine Complexity Evaluation Complexities of What Machines are We Interested in?
Complexities of What Machines are We Interested in?
The complexity of test task (from test template!) with nestedlearning machine.This complexity really well reflects complete behavior of themachine:
a part of test scheme reflects the complexity of learning of givenmachinethe rest reflects the complexity of computing the test (for exampleclassification test).
Costs of learning reflects the costs of creation of given machine.Costs of testing reflects the costs of using of given machine.
learning costs + testing costs = costs of machine using .
(DI/NCU) Meta-learning—WCCI 2010 24 / 48
Machine Complexity Evaluation The needs of complexity computation
The needs of complexity
To provide a learning machine (a simple one or a complex machine),its configuration and inputs must be specified:
A(Li ) : KLi ×Di →Mi , i = 1, . . . , n, (14)
Complexity computation must reflect the information fromconfiguration and inputs.The recursive nature of configurations, together with input–outputconnections.
(DI/NCU) Meta-learning—WCCI 2010 25 / 48
Machine Complexity Evaluation The needs of complexity computation
Meta-inputs descriptions, meta-outputs
Transform and classify
Training Data Classifier
Transformation scheme
Data
Data
Transformation
PCATraining data
Data
Transformation
Classifier scheme
Training data Classifier
kNNTraining data Classifier
Test scheme
Training Data
Test Data
SVM
Data Classifier
Classification testData
Classifier
It would be impossible to compute complexity of kNN before finishingPCA.To make complexity computation possible we use proper meta-inputsdescriptions.Meta-inputs are counterparts of inputs in the “meta-world”.Additionally complexity computation must also providemeta-outputs.
(DI/NCU) Meta-learning—WCCI 2010 26 / 48
Machine Complexity Evaluation The needs of complexity computation
Computing the complexity
A function computing the complexity for machine L should be atransformation:
DL : KL ×M+ → R2 ×M+, (15)
where KL is the configurations space andM+ is the space ofmeta-inputs (and outputs). R2 reflects the time complexity and thememory complexity.The problem is not as easy as the above form of the function. Findingthe right function for given learning machine L may be impossible.Configuration elements are not always as simple as scalarvalues.In some cases configuration elements are represented by functionsor by subconfigurations.
Similar problem concerns meta-inputs.(DI/NCU) Meta-learning—WCCI 2010 27 / 48
Machine Complexity Evaluation Meta Evaluators
Meta Evaluators
For so high level of generality we will use meta-evaluators.The general goal of meta-evaluator is to exhibit a functionaldescription of complexity aspects (comments) useful for further reuseby other meta evaluators. In case of a machine the meta-outputs areexhibited to provide complexity information source for their inputsreaders.
Learning machine Meta evaluator
meta evaluators are constructed not only for machines:
Machine
Evaluator for Output
Nontrivial objects
Each evaluator will need adaptation, which can be seen as aninitialization and can be compared to the learning of machine.
(DI/NCU) Meta-learning—WCCI 2010 28 / 48
Machine Complexity Evaluation Meta Evaluators
Output evaluatorsThanks to output evaluator:
Do : I1 →M1, (16)
where I1 is a space of (single) output andM1 is the space of singlemeta-output.we do not need machine evaluator of below form:
D′L : KL × I+ → R2 ×M+, (17)
where I+ is the space of machine L inputs.Nontrivial object evaluators
Sometimes, machine complexity depends on nontrivial elements ofconfiguration. For example, metric in configurations of kNN orSVMWe use metric evaluator to make the kNN’s evaluator independentdirectly from metric:
Dobj : OBJ →Mobj , (18)
(DI/NCU) Meta-learning—WCCI 2010 29 / 48
Machine Complexity Evaluation Meta Evaluators
Machine Evaluator
Additional functionality for machine evaluators:Declarations of output descriptions:
If given machine provides outputs, then also the output evaluators,devoted to this machine type, must provide meta-descriptions of theoutputs. The descriptions of outputs are meta evaluators ofappropriate kind (for example meta-classifier, meta-transformer,meta-data etc.).
Time & Memory:The two quantities must be provided by each machine evaluator toenable proper computation of time and memory complexity.
Child Evaluators:for advanced analysis of complex machines complexities, it is necessaryto have access to meta evaluators of submachines. Child evaluatorsare designed to provide this functionality.
(DI/NCU) Meta-learning—WCCI 2010 30 / 48
Machine Complexity Evaluation Meta Evaluators
Classifier EvaluatorEvaluator of a classifier output, has to provide the time complexity ofclassification of an instance:
real ClassifyCmplx(DataEvaluator dtm);
This is the time consumed by the instance classification routine.
Approximator EvaluatorEvaluator of an approximation machine has exactly the same functionalityas the one of a classifier, except that approximation time is considered inplace of classification time:
real ApproximationCmplx(DataEvaluator dtm);
(DI/NCU) Meta-learning—WCCI 2010 31 / 48
Machine Complexity Evaluation Meta Evaluators
Data Evaluators
In the context of data tables, the data evaluators should provideinformation like? the number of instances, ? the number of features, ? descriptions offeatures (ordered/unordered, number of values, etc.), ? descriptions oftargets, ? statistical information per feature, ? statistical informationper data and ? others that may provide useful information to computecomplexities of different machines learning from the data.
(DI/NCU) Meta-learning—WCCI 2010 32 / 48
Learning Evaluators Approximation framework
Approximation framework
If evaluators may not be defined in analytical way the approximationframework is used to construct evaluators.
Plain Evaluator
Ready to use evaluator
adaptation
Learnable Evaluator
data collection
evaluator learning
Ready to use evaluator
adaptation
Before an evaluator is used by a meta-learning process, all itsapproximators must be trained.Each evaluator is learned once.
(DI/NCU) Meta-learning—WCCI 2010 33 / 48
Learning Evaluators Approximation framework
Environment for machine behavior observation
To collect the learning data, proper information has to be extractedform observations of “machine behavior” .The “environment” for machine monitoring must be defined. Theenvironment configuration is sequentially adjusted, realized andobserved.Environment is defined by: initial configuration of observing scheme
Changes of the environment are realized by specialized scenarios,which defines how to modify the environment.Each scene is a source of single learning pair of single approximator.Each evaluator may consists of few independent approximators.
initial environment + scenes scenario
(DI/NCU) Meta-learning—WCCI 2010 34 / 48
Learning Evaluators Approximation framework
Data collection functionality
Full control of data acquisition is possible thanks to collection ofmethods implemented by each evaluator:
EvaluatorBase is used to prepare the evaluator for analysis of theenvironment being observed.GetMethodsForApprox declares additional approximation tasks.ApproximatorDataIn andApproximatorDataOut prepare subsequence of input–output vectors.
(DI/NCU) Meta-learning—WCCI 2010 35 / 48
Learning Evaluators Approximation framework
STARTInit: {startEnv,envScenario}
Try to generatenext configuration‘oc’ using startEnv& envScenario.Succeeded?
Train eachapproximator
Run machines ac-cording to ‘oc’
Return seriesof approximatorsfor evaluator
Extract in-out pairsfrom the project
for each approximatorSTOP
no
yes
data
collectionloop
(DI/NCU) Meta-learning—WCCI 2010 36 / 48
Learning Evaluators Approximation framework
Elements of MLA and their connections
Test taskHeap
ComplexityApproximator
EvaluatorRepository
Test taskGenerator
Quarantine AttractivenessModule
GeneratorsFlow
Meta-learningSearch Loop
Meta-learningResults
(DI/NCU) Meta-learning—WCCI 2010 37 / 48
Meta-learning in Action
Meta-learning in Action—Generators flow
To make the results comparable all test were prepared using belowgenerator flow:
Generators flow
output
RankingGenerator
Feature selectionof RankingsGenerator
ClassifiersGenerator
MPS for classifiersgenerator
Transform andclassify generator
MPS/FS ofTransform & Classifygenerator
(DI/NCU) Meta-learning—WCCI 2010 38 / 48
Meta-learning in Action
Classifier set:kNN (Euclidean) — k Nearest Neighbors / Euclidean metric,kNN [MetricMachine (EuclideanOUO)] — kNN w. Euclidean and HammingkNN [MetricMachine (Mahalanobis)]NBC — Naive Bayes ClassifierSVMClassifier — Support Vector Machine / Gaussian kernelLinearSVMClassifier — SVM with linear kernel[ExpectedClass, kNN [MetricMachine (EuclideanOUO)]][LVQ, kNN (Euclidean)] — Learning Vector Quantization algorithm (?)Boosting (10x) [NBC] — boosting algorithm with 10x NBCs.
Ranking set:RankingCC — correlation coefficient based feature ranking,RankingFScore — Fisher-score based feature ranking.
(DI/NCU) Meta-learning—WCCI 2010 39 / 48
Meta-learning in Action
1 kNN (Euclidean)2 kNN [MetricMachine (EuclideanOUO)]3 kNN [MetricMachine (Mahalanobis)]4 NBC5 SVMClassifier [KernelProvider]6 LinearSVMClassifier [LinearKernelProvider]7 [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]]8 [LVQ, kNN (Euclidean)]9 Boosting (10x) [NBC]10 [[[RankingCC], FeatureSelection], [kNN (Euclidean)], TransformAndClassify]11 [[[RankingCC], FeatureSelection], [kNN [MetricMachine (EuclideanOUO)]],
TransformAndClassify]12 [[[RankingCC], FeatureSelection], [kNN [MetricMachine (Mahalanobis)]], Trans-
formAndClassify]13 [[[RankingCC], FeatureSelection], [NBC], TransformAndClassify]14 [[[RankingCC], FeatureSelection], [SVMClassifier [KernelProvider]], Transfor-
mAndClassify]
(DI/NCU) Meta-learning—WCCI 2010 40 / 48
Meta-learning in Action
15 [[[RankingCC], FeatureSelection], [LinearSVMClassifier [LinearKernelProvider]],TransformAndClassify]
16 [[[RankingCC], FeatureSelection], [ExpectedClass, kNN [MetricMachine (Eu-clideanOUO)]], TransformAndClassify]
17 [[[RankingCC], FeatureSelection], [LVQ, kNN (Euclidean)], TransformAndClas-sify]
18 [[[RankingCC], FeatureSelection], [Boosting (10x) [NBC]], TransformAndClas-sify]
19 [[[RankingFScore], FeatureSelection], [kNN (Euclidean)], TransformAndClassify]20 [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (EuclideanOUO)]],
TransformAndClassify]21 [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (Mahalanobis)]],
TransformAndClassify]22 [[[RankingFScore], FeatureSelection], [NBC], TransformAndClassify]23 [[[RankingFScore], FeatureSelection], [SVMClassifier [KernelProvider]], Transfor-
mAndClassify]24 [[[RankingFScore], FeatureSelection], [LinearSVMClassifier [LinearKernel-
Provider]], TransformAndClassify]
(DI/NCU) Meta-learning—WCCI 2010 41 / 48
Meta-learning in Action
25 [[[RankingFScore], FeatureSelection], [ExpectedClass, kNN [MetricMachine (Eu-clideanOUO)]], TransformAndClassify]
26 [[[RankingFScore], FeatureSelection], [LVQ, kNN (Euclidean)], TransformAnd-Classify]
27 [[[RankingFScore], FeatureSelection], [Boosting (10x) [NBC]], TransformAnd-Classify]
28 ParamSearch [kNN (Euclidean)]29 ParamSearch [kNN [MetricMachine (EuclideanOUO)]]30 ParamSearch [kNN [MetricMachine (Mahalanobis)]]31 ParamSearch [NBC]32 ParamSearch [SVMClassifier [KernelProvider]]33 ParamSearch [LinearSVMClassifier [LinearKernelProvider]]34 ParamSearch [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]]35 ParamSearch [LVQ, kNN (Euclidean)]36 ParamSearch [Boosting (10x) [NBC]]37 ParamSearch [[[RankingCC], FeatureSelection], [kNN (Euclidean)], Transfor-
mAndClassify]
(DI/NCU) Meta-learning—WCCI 2010 42 / 48
Meta-learning in Action
38 ParamSearch [[[RankingCC], FeatureSelection], [kNN [MetricMachine (Eu-clideanOUO)]], TransformAndClassify]
39 ParamSearch [[[RankingCC], FeatureSelection], [kNN [MetricMachine (Maha-lanobis)]], TransformAndClassify]
40 ParamSearch [[[RankingCC], FeatureSelection], [NBC], TransformAndClassify]41 ParamSearch [[[RankingCC], FeatureSelection], [SVMClassifier [KernelProvider]],
TransformAndClassify]42 ParamSearch [[[RankingCC], FeatureSelection], [LinearSVMClassifier [LinearKer-
nelProvider]], TransformAndClassify]43 ParamSearch [[[RankingCC], FeatureSelection], [ExpectedClass, kNN [MetricMa-
chine (EuclideanOUO)]], TransformAndClassify]44 ParamSearch [[[RankingCC], FeatureSelection], [LVQ, kNN (Euclidean)], Trans-
formAndClassify]45 ParamSearch [[[RankingCC], FeatureSelection], [Boosting (10x) [NBC]], Trans-
formAndClassify]46 ParamSearch [[[RankingFScore], FeatureSelection], [kNN (Euclidean)], Transfor-
mAndClassify]47 ParamSearch [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (Eu-
clideanOUO)]], TransformAndClassify]
(DI/NCU) Meta-learning—WCCI 2010 43 / 48
Meta-learning in Action
48 ParamSearch [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (Ma-halanobis)]], TransformAndClassify]
49 ParamSearch [[[RankingFScore], FeatureSelection], [NBC], TransformAndClas-sify]
50 ParamSearch [[[RankingFScore], FeatureSelection], [SVMClassifier [Kernel-Provider]], TransformAndClassify]
51 ParamSearch [[[RankingFScore], FeatureSelection], [LinearSVMClassifier [Linear-KernelProvider]], TransformAndClassify]
52 ParamSearch [[[RankingFScore], FeatureSelection], [ExpectedClass, kNN [Met-ricMachine (EuclideanOUO)]], TransformAndClassify]
53 ParamSearch [[[RankingFScore], FeatureSelection], [LVQ, kNN (Euclidean)],TransformAndClassify]
54 ParamSearch [[[RankingFScore], FeatureSelection], [Boosting (10x) [NBC]],TransformAndClassify]
(DI/NCU) Meta-learning—WCCI 2010 44 / 48
Meta-learning in Action
Task id
Time line0.0 0.25 0.5 0.75 1.0
Accuracy
0.00.51.0
12
3
4
5
6
7
89
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
121212
12
1212
11
10
98
8888
88
77
77
7777
77
7
77
7
7
77
7
7
77
6
5
5
44
33
2
2
11
1
11
11
1
Complexity0.046.21172 Task id
Time line0.0 0.25 0.5 0.75 1.0
Accuracy
0.00.51.0
12
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
1818
18
18
17
17
16
16
1515
1515
1414141414141414
14141414
14
14141414141414
14
1414
14
13
12
11
10
9
9
88
7
66
55
4
3
2
11
Complexity0.01.1984
Mushroom & German numeric
(DI/NCU) Meta-learning—WCCI 2010 45 / 48
Meta-learning in Action
Task id
Time line0.0 0.25 0.5 0.75 1.0
Accuracy
0.00.51.0
1
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
2930
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
29
28
28
2727
27
27
2626
25
24
2322
22
21
2019
18
18
18
1717
16
15
15
1414
13
12
12
12
12
12
11
11
11
11
11
11
10
10
9
9
88
7
7
6
5
4
3
2
2
1
Complexity0.00.63452 Task id
Time line0.0 0.25 0.5 0.75 1.0
Accuracy
0.00.51.0
12
3
4
5
6
7
89
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
24
24
23
22
2120
20
2020
20
1919
1919
1818
1818
1717
16
15
14
1313
1313
12
11
1010
9
88
77
77
6
5544
44
4
4444
3
3
2
1
Complexity0.030.04832
Glass & Thyroid
(DI/NCU) Meta-learning—WCCI 2010 46 / 48
Summary
Summary
Appropriate abstract–level configuring of meta-learning = universalMLA.Flexible generators flow for defining smart search space.Complexity controlled order of searching.System additionally supported by meta-knowledge (from experts andcollected in learning).
Google: W. Duch => Papers & presentationsNorbert: http://www.is.umk.pl/˜norbert/metalearning.htmlKIS: http://www.is.umk.pl => On-line publicationsBook: Meta-learning in Computational Intelligence (coming . . . ).
Have a nice week in Barcelona!
(DI/NCU) Meta-learning—WCCI 2010 47 / 48