lr1. summary day 1
TRANSCRIPT
State of the Art in ML
• History
• Machine Learning problems and Tasks➔ Supervised Learning: Classi$cation, Regression, Multi-label classi$cation
➔ Unsupervised Learning: Clusters, Anomaly Detectors
➔ Semi-supervised Learning: Inference from partially labeled
• Features: numeric, categorical, date-time, text
text analysis: frequency-weighted bag of words
Poul Petersen (BigML)
Explicit rules
Di1cult to $nd and re-train
Explicit rules
Di1cult to $nd and re-train
Explicit rules
Di1cult to $nd and re-train
Implicit rules(data rules)Easy to re-train
• Technology
• Teaching computers to learn:
too general vs. too speci$c (under-$tting vs. over-$tting)
Missing values handling: new category, averages, mutiple choices
State of the Art in ML
Storage
low prices, big data
APIsCombination andaccessibility
CloudComputationalpower
Predictive APIs
• Supervised learning:
Classi$cation (output in a set of classes)
Regression (output is a number)
• Unsupervised learning: no output info
• Training / Test separation: partioning data, boostrap or
cross-validation
• Classi$cation: Confusion Matrix
Evaluating ML Algorithms
Cèsar Ferri (UPV)
• Classi$cation metrics: Accuracy, Precision, Recall, F-measure
Extending to multi-class problems (averaging)
• Regression metrics: Mean Absolute error
Mean Squared error (more sensitive to extreme errors)
Root Mean Squared Error
Normalized for classi$ers comparison:
Relative Mean Squared Error
Relative Mean Absolute error
R2
• Unsupervised evaluation: no estimations, association rules,
support
• Clustering: distance and shape based evaluation (border, centers, distribution)
Evaluating ML Algorithms
Cèsar Ferri (UPV)
• History
• Classi$cation and Regression Trees
Structure where data is repeatedly separated in groups according to attribute values to minimize error / maximize information gain (split criterion: gini impurity)
Decision Trees
Gonzalo Martinez (UAM)
Expert BasedSystems
Human experts' rules
Automatized Knowledge Acquisition
Mining archives of cases (scalable)MYCIN: 600 rules
XCON: 2500 rules Rules:CHAID, CART, ID3, C4.5
Decision Trees
Automatized Knowledge Acquisition
Mining archives of cases
MYCIN: 600 rulesXCON: 2500 rules
CHAID, CART, ID3, C4.5
PROs
● Convertible to rules
● Categorical and numeric attributes
● Handle uninformative or redundant attributes
● Handle missing values
● Non-parametric (no prede$ned idea of concept to learn)
● Easy to tune (small number of parameters)
CONs
● Complex features interactions● Replication problem
Decision TreesPredicatesRules are based on the split predicatesMissing valuesOblique splits (compare features) Stopping criteriaAll instances in one class
No split found
Small number of instances
Gain below threshold
Maximum depth
PruningTo avoid over-$ttingCART is slower (more trees needed, avoids complexity)C4.5 faster but no con$dence threshold (avoids small nodes)
Parameters Number of
nodes, depth, pruning (on/oD and con$dence), minimum number of instances to split
Ensembles of Decision Trees
Gonzalo Martinez (UAM)
• Ensembles of models
Randomizing to decrease errors and over-$tting: data, features or algorithms
New Instance: x
1 1 2 1 2 11
Combined with voting or non-voting strategies (aggregators)
Best overall performance (SVN)
Almost parameter-less
On trees, very fast to train and test
Slower than a single classifier (mitigated with pruning)
Ensembles of Decision Trees
• Robust
• Improves error
• Parallelizable
Original datasetBootstrap sample 1
Repeated example
Removed example
…
…
Bootstrap sample T
BAGGING
Ensembles of Decision Trees
BOOSTING
Original datasetIteration 1
…
…
Iteration 2
Good average generalization error
Not robust (noise)
Can increment error of the base classifier
Not parallelizable
Ensembles of Decision Trees
• Robust
• Improves error
• Parallelizable
• Better than boosting
• Very fast to train
Original datasetBootstrap sample 1
Repeated example
Removed example
Random feature subset
…
…
Bootstrap sample T
RANDOM FORESTS
Ensembles of Decision Trees
CLASS SWITCHING
Original datasetRandomnoise 1
…
…
Randomnoise T
p=30%
Can improve results for cases wherenormal decision trees are not specially good
• Human knowledge used to compensate data problems: broken data (remove corner cases, defaults), missing
values (have meaning), reduce complexity (grouping classes), distances
• Discretization: signi$cant bins against concrete values
• Delta: diDerence or distance between features can be signi$cant
• Standarization: Mean of zero and standard deviation of one
• Normalizing: Feature vectors with unit norm
• Windowing: Previous points distributed in time
Data Transformations and FECharles Parker (BigML)
• Projections: Combining to have a new feature basis (lowering
dimensionality)
New axis: Principal component analysis
Keep neighbours: Spectral embeddings , Combination methods (Large Margin Nearest Neighbor, Xing’s Method)
• Sparsity: compressing sparse text and images data by sampling and
grouping
Data Transformations and FE
• Sub-sampling and Over-sampling: Restore balance by
eliminating over-sampled categories or giving higher weight to under-represented categories
• Evaluating Unbalanced Datasets Good accuracy is not enough. Look at precision and recall
Precision vs. Recall trade-oD: you must de$ne the cost for each
(letting out positives against letting in negatives)
Unbalanced Datasets
Poul Petersen (BigML)
Fraud Not Fraud0
750
1500
2250
3000
3750