deﬁnition of univariate b-splines - uni-hamburg.de · deﬁnition of univariate b-splines the...

Definition of univariate B-splines

The B-splines are employed to specify the linguistic terms, and knots are chosento be different from each other (periodical model). Visually, the selection of k (theorder of the B-splines) determines the following factors of the fuzzy sets formodeling the linguistic terms.

Assume x is a general input variable of a control system that is defined on theuniverse of discourse [x1, xm]. Given a sequence of ordered parameters (knots):x1, x2, . . . , the ith B-spline Ni,k of order k (degree k − 1) is recursively defined asfollows:

Ni,k(x) =

{

1 for x ∈ [xi, xi+1)0 otherwise

if k = 1

x−xixi+k−1−xi

Ni,k−1(x) + xi+k−x

xi+k−xi+1Ni+1,k−1(x) if k > 1

(1)

with i = 1, . . . ,m− k.

Angewandte Sensorik, J. Zhang Lernmethoden, W4/2003, 21. Januar 2003 263

Therefore, m knots xi(i = 1, . . . ,m) form l = m− k B-splines (Figure 1).

Abbildung 1: Nine B-splines of order 3 defined over 12 non-uniformly distributedknots.


Examples of B-splines of order 1, 2, 3 and 4 with their knots are shown in Figure 2.

Abbildung 2: Nonuniform univariate B-splines of oder 1 to 4 defined on a parameterx.

In each interval [xj, xj+1], k non-zero B-splines overlap.


The example of order 3 (cubic B-splines) is shown in Figure 3.

Abbildung 3: Cubic B-splines [xj, xj+1] defined on a parameter x.


Properties of B-Splines

Recursive definition is one basic feature of B-splines, which enables the generationof B-splines of arbitrary orders with the incremental smoothness for a given set ofknots. The other most important properties of B-splines, in respect to modelingand control are:

Partition of unity:∑l

i=0 Ni,k(x) = 1.

Positivity: Ni,k(x) ≥ 0 for all x.

Local support: Ni,k(x) = 0 for x /∈ [xi, xi+k].

Ck−2 continuity: If the knots {xi} are pairwise different fromeach other, then Ni,k(x) ∈ Ck−2, i.e., Ni,k(x)is (k − 2) times continuously differentiable.


Lattice

Abbildung 4: The B-spline model – a two-dimensional illustration.


Each n-dimensional rectangle (n > 1) of the lattice is covered by the jth

multivariate B-spline N jk(x) which is formed by taking the tensor product of n

univariate B-splines:

N jk(x) =

n∏j=1

N jij,kj

(xj) (2)

Therefore the shape of each B-spline, and thus the shape of multivariate ones(Figure 5), is implicitly set by their order and their given knot distribution on eachinput interval.


(a) Tensor product of two, order 2univariate B-splines.

(b) Tensor product of one order3 and one order 2 univariate B-splines.

(c) Tensor product of two univa-riate B-splines of order 3.

Abbildung 5: Bivariate B-splines formed by taking the tensor product of twounivariate B-splines.


Fuzzy-Controller eines MISO-Systems - I

Conditions of B-spline Fuzzy Controllers:

• periodical B-spline basis functions as membership functions for inputs,

• fuzzy singletons as membership functions for outputs,

• “product” as fuzzy conjunctions,

• “centroid” as defuzzification method,

• addition of “virtual linguistic terms” at both ends of each input variable and

• extension of the rule base for the “virtual linguistic terms” by copying theoutput values of the “nearest” neighbourhood.


B-Spline-Fuzzy-Controller eines MISO-Systems - II

A MISO system with n inputs x1, x2, . . . , xn, rules with the n conjunctive terms in the premise

are given in the following form:

{Rule(i1, i2, . . . , in): IF (x1 is N1i1,k1

) and (x2 is N2i2,k2

) and . . . and (xn is Nnin,kn

) THEN y

is Yi1i2...in},

where

• xj: the j-th input (j = 1, . . . , n),

• kj: the order of the B-spline basis functions used for xj,

• N jij,kj

: the i-th linguistic term of xj defined by B-spline basis functions,

• ij = 0, . . . ,mj, representing how fine the j-th input is fuzzy partitioned,

• Yi1i2...in: the control vertex (deBoor points) of Rule(i1, i2, . . . , in).


Fuzzy-Controller eines MISO-Systems - III

The output y of a MISO fuzzy controller is:

y =

∑m1i1=0 . . .

∑mnin=0(Yi1,...,in

∏nj=1 N j

ij,kj(xj))∑m1

i1=0 . . .∑mn

in=0

∏nj=1 N j

ij,kj(xj)

(3)

=

m1∑i1=0

. . .

mn∑in=0

(Yi1,...,in

n∏j=1

Njij,kj

(xj)) (4)

This is called a general NUBS hypersurface, which possesses the followingproperties:

• If the B-functions of order k1, k2, . . . , kn are employed to specify the linguistic terms of the

input variables x1, x2, . . . , xn, it can be guaranteed that the output variable y is (kj − 2)

times continuously differentiable in respect to the input variable xj, j = 1, . . . , n.

• If the input space is partitioned fine enough and at the correct positions, the interpolation with

the B-spline hypersurface can reach a given precision.


B-spline Type: SISO Systems

A SISO system with B-functions of order 2 (Xi(x): firing strength of rule i; yi: thecontribution of rule i to the output).


MISO Systems - A 2D Example

An example with two input variables (x and y) and an output z. The controlvertices of the output are Z1, Z2, Z3, Z4.

The linguistic terms of the inputs:

The linguistic terms of the output:


A 2D Example - The Rule Base

The rule base consists of four rules:Rule

1) IF x is X1 and y is Y1 THEN z is Z1





A 2D Example - Inference


A 2D Example - Defuzzification


Supervised Learning

Supervised learning assumes that a “teacher” provides the complete desiredsystem output for each input datum.

Based on the complete set of these input/output vectors, B-spline type fuzzycontrollers can be trained very rapidly.

Computing parameters of such a B-spline fuzzy system is divided into two steps:for the IF-part and for the THEN-part.

Considering the granularity of the input space and the maximal point distributionof the control space if known, the fuzzy sets can be generated using the recursivecomputation of B-spline basis functions.

The control vertices of the THEN parts can be automatically achieved through alearning procedure.


Learning algorithm - I

Assume {(X, yd)} is a set of training data, where

• X = (x1, x2, . . . , xn) : the input data vector,

• yd : the desired output for X.

The squared error is computed as:

E =12(yr − yd)2, (5)

where yr is the current real output value during training.


The parameters to be found are Yi1,i2,...,in, which make the error in (5) as small aspossible, i.e.

E =12(yr − yd)2 ≡ MIN. (6)

Each control vertex Yi1,...,in can be modified by using the gradient descentmethod:

∆Yi1,...,in = −ε∂E

∂Yi1,...,in

(7)

= ε(yr − yd)n∏

j=1

N jij,kj

(xj) (8)

where 0 < ε ≤ 1.


The gradient descent method guarantees that the learning algorithm converges tothe global minimum of the error function because the second partial differentiationin respect to the quadratic error function Yi1,i2,...,in is constant:

∂2E

∂2Yi1,...,in

= (n∏

j=1

N jij,kj

(xj))2 ≥ 0. (9)

This means that the error function (5) is convex in the space Yi1,i2,...,in andtherefore possesses only one (global) minimum.


Immediate learning by self-evaluation

A fuzzy system can learn under supervision.

Such a learning process needs a teacher, i.e. for each input vector, the desiredoutput should be known. Then the fuzzy controller attempts to interpolate theseinput/output vectors to provide a continuous (hyper-)surface for the whole controlspace.

In reality, it is not always simple to find the goal function of the output for acomplex system. An unsupervised learning approach should therefore bedeveloped.

Based on a B-spline fuzzy controller, the parameters to be learned are still mainlythe control vertices of the “THEN” part.

The key problem of unsupervised learning with such a model is then how tomodify the control vertices after each learning step, i.e. the change direction (+ or-) and the change magnitude.


Inspiration by Supervised Learning

We first discuss a control system with (X1, X2, . . . , Xn) as input and Y as output.Let us rewrite the modification of the control vertices for supervised learning:

∆yi1,...,im = −ε∂E

∂yi1,...,im

= ε(yr − yd).m∏

j=1

Xij,kj(xj)

= sign(yr − yd) ε .|yr − yd|.m∏

j=1

Xij,kj(xj) (10)

sign(yr − yd) indicates the direction of the modification of yi1,...,im in eachlearning step, while the product ε · |yr − yd| ·

∏mj=1 Xij,kj

(xj) determines themagnitude of the modification.


Evaluation Function - I

In unsupervised learning, it is usually possible to define an “evaluation function”.Such an evaluation function should describe how “good” the current system state((x1, x2, . . . , xn), y) is.

For each input vector, an output is generated. With this output, the systemtransits to another state. The new state is compared with the old one; anadaptation is performed if necessary.

Assume the evaluation function, denoted by V (·), possesses a bigger value for abetter state, i.e. for two states st and st+1, if st is better than st+1, thenV (st) ≥ V (st+1). The adaptation of the control vertices can be performed with asimilar representation as in supervised learning.


Evaluation Function - II

Let us reconsider the modification of the control vertices through the equation(10). State st transits to st+1 by the output yr. The desired state is sd. Wereplace yr in (10) with V (st+1), yd with V (sd).

Assume two system states st and st+1, and st is better than st+1, i.e.V (st) ≥ V (st+1), where V (·) is the evaluation function.

We consider those systems, for which a function V (·) can be found which fulfillsthe following condition:

Assume st is the current state and y an arbitrary output. With y the systemtransits to the state st+1. If another output y′ fulfills y × y′ ≤ 0, and with y′

the system transits to s′t+1, the following relation of the evaluation functions isvalid:

( V (st+1) − V (st) )× ( V (s′t+1) − V (st) ) ≤ 0. (11)


Modifying Control Vertices in Reinforcement Learning - I

At the moment t the system has the state st. The ideal state of the moment t + 1would be sd.

With the controller output yr generated at the moment t, the system transits tothe state st+1.

Considering the state transition from st to st+1, the constellation of st, st+1 andsd:

(a) (b) (c)


Modifying Control Vertices in Reinforcement Learning - II

(a): The system state becomes worse, i.e. the system acts incorrectly. According tothe condition in (11) the change direction is −sign(y).

(b): The system acts in the correct direction. The value of the output should beenlarged. The change direction is then sign(y).

(c): This case is the inverse case of the case (b). The change direction should be−sign(y).


These three cases can be synthesized by

S = sign(V (st)− V (st+1)) ∗ sign(V (st+1)− V (sd)) ∗ sign(y). (12)

The change of control vertices can finally be written as:

∆yi1,...,im = S . ε . |V (st+1)− V (sd)| .m∏

j=1

Xij,kj(xj). (13)


Learning of Cart-Pole Balancing

The pendulum possesses an initial state (θ, θ). To be solved is the force f to beexerted, which is able to bring the cart-pole system to the balanced final stateθ = 0 and θ = 0.

The inputs of the system are:

• angle: θ(◦) ∈ [−15,+15] and

• angle velocity: θ(◦/s) ∈ [−20,+20].

Each of the two input variables are covered with 7 B-spline basis functions oforder 3.

The output of the system is the force f to be exerted on the cart.


For learning we choose the evaluation function as:

V (st) = V (θ, θ)def= −|2 ∗ θ + θ|, and the relation of the evaluation functions of

the desired state sd and A: V (sd)def= 0.5 ∗ V (st).


CP-Balancing: Control Surfaces

at the beginning: after 100 learning steps after 3000 learning steps


CP-Balancing - Validation

The motion profiles of the pendulum from the starting state (θ=-10, θ=10):

angle:

angle velocity:

applied force:


Inverses Pendel: I

Problem Balanciere Pendel P durch Steuerung des Motors M

Eingang: zwei Zustand-Variablen:

• Winkel θ;• Winkelgeschwindigkeit θ

als Differenz ∆θt = θt − θt−1


Ausgang: eine Steuer-Variable Motor-Storm→ Motor-Geschwindigkeit v

Quantisierung von drei linguistischen Variablen in jeweils sieben Fuzzy-Mengen(linguistischen Termen):

{NB, NM,NS,Z, PS, PM, PB}

Beispiel: Regel (NM, Z; PM)

Wenn der Winkel θ in seinem mittleren negativen Bereich istund die Winkelgeschwindigkeit θ ungefahr Null ist,

Dann sollte die Motor-Geschwindigkeit v in ihrem mittleren positiven Bereich sein.

Die Regelbasis in Tabellenform:


θNB NM NS Z PS PM PB

NB PBNM PMNS PS NS

∆θ Z PB PM PS Z NS NM NBPS PS NSPM NMPB NB


Miniatur–Roboter KHEPERA

• Motorola 68331 Micro–Controller

• 128 KByte RAM, 128 KByte ROM

• Verbindung zur Außenwelt uber ein serielles Kabel

• 2 Schrittmotoren, 600 Schritte/Umdrehung, d.h. ein Schritt entspricht 1/12mm

• 8 Nahbereichssensoren (Infrarot), Siemens SFH900, Empfindlichkeit maximal5cm


KHEPERA — Sensoren

Eingabe fur Regelung: IR Sensoren

0: SL85, 1: SL45, Mittelwert von 2 und 3: SLR0,

4: SR45, 5: SR85

Sensor Meßwerte gegen deren Distanz:


Problem der Hindernisvermeidnung

Ausgabe: Geschwindigkeiten des linken und rechten Motors⇒ Robotergeschwindigkeit v, Steuerwinkel s

Ziel: Kollisionsvermeidung, d.h., moglichst“sanftes” Umfahren von Hindernissen

Struktur des Fuzzy-Reglers:


ZF der Ein- und Ausgange

IR-Sensorwerte:

Robotergeschwindigkeit v:


Steuerwinkel s:


Die Regeln des Systems: I

Ausweichmanover im freien Raum beim Erkennen eines Hindernisses von rechts:


Fuzzy–Eingangsvariablen Ausgangsvar.

SL85 SL45 SLR0 SR45 SR85 speed steer

vl vl vl vl low high n

vl vl vl low low low nm

vl vl low low low low nb

vl low low low low low nb

vl vl vl vl high low nm

vl vl vl low high vl nb

vl vl low low high vl nb

vl vl vl high high vl nb

vl vl high high high vl nb

vl vl vl vl vh vl nb

vl vl vl low vh vl nb

vl vl vl high vh vl nb

vl vl low high vh vl nb

vl low high high vh vl nb

vl vl vl vh vh vl nb

vl vl low vh vh vl nb

vl vl vh vh vh vl nb

vl low vh vh vh vl nb

low high vh vh vh vl nb


Autonome mobile Roboter: 1

Ziel: Zielfahrt und Kollisionsvermeidung

Besonderheiten:

• Fuzzyfikation der Sensorsignale;

(b) Laser range finder



• Fuzzy-Regeln fur die Realisierung von Verhaltensmustern (“behaviors”);

GO → SC 1 Regel

OP → SC 4 Regeln

GO → TC 3 Regeln

“Far” OP → TC 2 Regeln

“Near” OP → TC 2 Regeln

“Very close” OP → TC 3 Regeln

wobei SC (“speed control”) und TC (“turn control”) Funktionen von GO (“goal orientation”)

und OP (“obstacle proximity”) sind.



• Darstellung des Verhaltens “goal-tracking”:


• On-Board-VLSI-Chip

→ Alle Regeln konnen in 30 µs verarbeitet werden.


Reinforcement Learning

Der Roboter erhalt in jedem Regelungszyklus sowohl Sensordaten als auch einReinforcement-Signal, dann fuhrt er eine Aktion aus, welche seinen Zustandverandert.

Reinforcement Learning liegt zwischen uberwachtem Lernen und unuberwachtemLernen.

Der Roboteragent kann auch uber ein “delayed reinforcement” Signal lernen.Dabei wird auch eine Aktion des Roboteragenten belohnt, wenn sie nur indirektzum Ziel gefuhrt hat. Dies kann der Fall sein, wenn die entsprechende Aktionausgefuhrt werdenmußte, um weitere Aktionen in Richtung des Zielzustandesausfuhren zu konnen.


Erwerb von Fertigkeiten eines Roboters

Skill acquisition: “Verbesserung mototischer oder kognitiver Fahigkeiten durchTraining. Lesen einer Anleitung stellt nur das initiale Wissen dar, das dannsukzessiv verbessert und verfeinert werden muss.”

(Carbonell et. al. 1983”)

Illustration des Reinforcement-Lernproblems:


Markov-Entscheidungsprozeß

(“Markov Decision Process” MDP)

Ein MDP ist gegeben durch

• Eine Menge S diskreter Zustande (states),

• Eine Menge A moglicher Handlungen (actions),

• eine Reward-Funktion rt = r(st, at),

• Eine Successor-Funktion st+1 = δ(st, at),

Die Funktion r und δ sind Teil der Umgebung und dem Agenten nicht notwendigbekannt.


Graph zu einem Markov-Entscheidungsprozeß


Das Problem der unvollstandigen Zustandsinformation

.

Man spricht auch von verborgenen Zustanden (engl.: hidden states).

Beispiel fur unvollstandige Zustandsinformation:

a) b) c) d)


Ablauf des MDPs

Zu jedem Zeitschritt t durchlauft der Agent folgende Schritte:

1. Bestimme den aktuellen Zustand st.

2. Wahle eine Handlung at.

3. Fuhre at aus.

4. Erhalte Reward rt = r(st, at).

Die Umgebung geht als Reaktion auf at in einen neuen Zustand st+1 = δ(st, at)uber.


Policy

Eine Funktionπ : S → A

wird Policy genannt.

Sie stellt eine Strategie dar, wie der Agent in einem bestimmten Zustand s eineHandlung a = π(s) auswahlt.

Die Aufgabe besteht darin, diese Funktion π zu lernen.


Kumulativer Wert

Der kumulative WertV π(st)

ist die kumulierte Reward, die der Agent erzielt, wenn er von einem Zustand st

startet und einer Policy π folgt.

Es gibt unterschiedliche Definitionen fur V π(st), die zukunftige Rewards inunterschiedlicher Weise mit einbeziehen.


Definitionen fur V π(st)

• “Dicount cumulative reward”: V π(st) =∑∞

i=0 γirt+i

• “Finite horizon reward”: V π(st) =∑h

i=0 rt+i

• “Average reward”: V π(st) = limh→∞1h

∑hi=0 rt+i


Optimale Policy

Eine Policy, die V π(st) fur alle Zustande s maximiert, wird optimale Policy π∗

genannt:

π∗ ≡ arg maxπ

V π(s),∀s

Der kumulative Wert einer optimalen Policy wird auch mit V ∗(s) bezeichnet:

V ∗(s) ≡ V π∗(s)


Lernen der optimalen Policy

Aus der Definition von V π(st)

V π(st) =∞∑

i=0

γirt+i

folgt sofort fur π∗(s):

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

D.h.: Die optimale Policy kann erlernt werden, indem V ∗ gelernt wird, falls r undδ bekannt sind.

Aber dies ist oft nicht der Fall!


Modellbasiert oder modellfrei?

Modellbasiertes Reinforcement-Lernen:

z.B. mit dynamischer Programmierung.

Vergleich mit A*-Suche.

Anwendungsbeispiel: z.B. kollisionsfreie Bahnplanung unter bekannterUmgebungsdarstellung.

Modellfreies Reinforcement-Lernen:

r und δ sind unbekannt.

⇒: Q-Lernen


deﬁnition of univariate b-splines - uni-hamburg.de · deﬁnition of univariate b-splines the...

Documents