optimal transport methods in operations research and

170
Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and Department of IEOR). Blanchet (Columbia U. and Stanford U.) 1 / 60

Upload: others

Post on 13-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimal Transport Methods in Operations Research and

Optimal Transport Methods in Operations Research andStatistics

Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F.Zhang).

Stanford University (Management Science and Engineering), and Columbia University(Department of Statistics and Department of IEOR).

Blanchet (Columbia U. and Stanford U.) 1 / 60

Page 2: Optimal Transport Methods in Operations Research and

Goal:

Goal: Introduce optimal transport techniquesand applications in OR & Statistics

Optimal transport is useful tool in model robustness, equilibrium,and machine learning!

Blanchet (Columbia U. and Stanford U.) 2 / 60

Page 3: Optimal Transport Methods in Operations Research and

Agenda

Introduction to Optimal Transport

Economic Interpretations and Wasserstein Distances

Applications in Stochastic Operations Research

Applications in Distributionally Robust Optimization

Applications in Statistics

Blanchet (Columbia U. and Stanford U.) 3 / 60

Page 4: Optimal Transport Methods in Operations Research and

Agenda

Introduction to Optimal Transport

Economic Interpretations and Wasserstein Distances

Applications in Stochastic Operations Research

Applications in Distributionally Robust Optimization

Applications in Statistics

Blanchet (Columbia U. and Stanford U.) 3 / 60

Page 5: Optimal Transport Methods in Operations Research and

Agenda

Introduction to Optimal Transport

Economic Interpretations and Wasserstein Distances

Applications in Stochastic Operations Research

Applications in Distributionally Robust Optimization

Applications in Statistics

Blanchet (Columbia U. and Stanford U.) 3 / 60

Page 6: Optimal Transport Methods in Operations Research and

Agenda

Introduction to Optimal Transport

Economic Interpretations and Wasserstein Distances

Applications in Stochastic Operations Research

Applications in Distributionally Robust Optimization

Applications in Statistics

Blanchet (Columbia U. and Stanford U.) 3 / 60

Page 7: Optimal Transport Methods in Operations Research and

Agenda

Introduction to Optimal Transport

Economic Interpretations and Wasserstein Distances

Applications in Stochastic Operations Research

Applications in Distributionally Robust Optimization

Applications in Statistics

Blanchet (Columbia U. and Stanford U.) 3 / 60

Page 8: Optimal Transport Methods in Operations Research and

Introduction to Optimal Transport

Monge-Kantorovich Problem & Duality(see e.g. C. Villani’s 2008 textbook)

Blanchet (Columbia U. and Stanford U.) 4 / 60

Page 9: Optimal Transport Methods in Operations Research and

Monge Problem

What’s the cheapest way to transport a pile of sand to cover asinkhole?

Blanchet (Columbia U. and Stanford U.) 5 / 60

Page 10: Optimal Transport Methods in Operations Research and

Monge Problem

What’s the cheapest way to transport a pile of sand to cover asinkhole?

minT (·):T (X )∼v

Eµ {c (X ,T (X ))} ,

where c (x , y) ≥ 0 is the cost of transporting x to y .T (X ) ∼ v means T (X ) follows distribution v (·).Problem is highly non-linear, not much progress for about 160 yrs!

Blanchet (Columbia U. and Stanford U.) 6 / 60

Page 11: Optimal Transport Methods in Operations Research and

Monge Problem

What’s the cheapest way to transport a pile of sand to cover asinkhole?

minT (·):T (X )∼v

Eµ {c (X ,T (X ))} ,

where c (x , y) ≥ 0 is the cost of transporting x to y .

T (X ) ∼ v means T (X ) follows distribution v (·).Problem is highly non-linear, not much progress for about 160 yrs!

Blanchet (Columbia U. and Stanford U.) 6 / 60

Page 12: Optimal Transport Methods in Operations Research and

Monge Problem

What’s the cheapest way to transport a pile of sand to cover asinkhole?

minT (·):T (X )∼v

Eµ {c (X ,T (X ))} ,

where c (x , y) ≥ 0 is the cost of transporting x to y .T (X ) ∼ v means T (X ) follows distribution v (·).

Problem is highly non-linear, not much progress for about 160 yrs!

Blanchet (Columbia U. and Stanford U.) 6 / 60

Page 13: Optimal Transport Methods in Operations Research and

Monge Problem

What’s the cheapest way to transport a pile of sand to cover asinkhole?

minT (·):T (X )∼v

Eµ {c (X ,T (X ))} ,

where c (x , y) ≥ 0 is the cost of transporting x to y .T (X ) ∼ v means T (X ) follows distribution v (·).Problem is highly non-linear, not much progress for about 160 yrs!

Blanchet (Columbia U. and Stanford U.) 6 / 60

Page 14: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Problem

Let Π (µ, v) be the class of joint distributions π of random variables(X ,Y ) such that

πX = marginal of X = µ, πY = marginal of Y = v .

Solvemin{Eπ [c (X ,Y )] : π ∈ Π (µ, v)}

Linear programming (infinite dimensional):

Dc (µ, v) : = minπ(dx ,dy )≥0

∫X×Y

c (x , y)π (dx , dy)∫Y

π (dx , dy) = µ (dx) ,∫X

π (dx , dy) = v (dy) .

If c (x , y) = dp (x , y) (d-metric) then D1/pc (µ, v) is a p-Wasserstein

metric.

Blanchet (Columbia U. and Stanford U.) 7 / 60

Page 15: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Problem

Let Π (µ, v) be the class of joint distributions π of random variables(X ,Y ) such that

πX = marginal of X = µ, πY = marginal of Y = v .

Solvemin{Eπ [c (X ,Y )] : π ∈ Π (µ, v)}

Linear programming (infinite dimensional):

Dc (µ, v) : = minπ(dx ,dy )≥0

∫X×Y

c (x , y)π (dx , dy)∫Y

π (dx , dy) = µ (dx) ,∫X

π (dx , dy) = v (dy) .

If c (x , y) = dp (x , y) (d-metric) then D1/pc (µ, v) is a p-Wasserstein

metric.

Blanchet (Columbia U. and Stanford U.) 7 / 60

Page 16: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Problem

Let Π (µ, v) be the class of joint distributions π of random variables(X ,Y ) such that

πX = marginal of X = µ, πY = marginal of Y = v .

Solvemin{Eπ [c (X ,Y )] : π ∈ Π (µ, v)}

Linear programming (infinite dimensional):

Dc (µ, v) : = minπ(dx ,dy )≥0

∫X×Y

c (x , y)π (dx , dy)∫Y

π (dx , dy) = µ (dx) ,∫X

π (dx , dy) = v (dy) .

If c (x , y) = dp (x , y) (d-metric) then D1/pc (µ, v) is a p-Wasserstein

metric.

Blanchet (Columbia U. and Stanford U.) 7 / 60

Page 17: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Problem

Let Π (µ, v) be the class of joint distributions π of random variables(X ,Y ) such that

πX = marginal of X = µ, πY = marginal of Y = v .

Solvemin{Eπ [c (X ,Y )] : π ∈ Π (µ, v)}

Linear programming (infinite dimensional):

Dc (µ, v) : = minπ(dx ,dy )≥0

∫X×Y

c (x , y)π (dx , dy)∫Y

π (dx , dy) = µ (dx) ,∫X

π (dx , dy) = v (dy) .

If c (x , y) = dp (x , y) (d-metric) then D1/pc (µ, v) is a p-Wasserstein

metric.

Blanchet (Columbia U. and Stanford U.) 7 / 60

Page 18: Optimal Transport Methods in Operations Research and

Illustration of Optimal Transport Costs

Monge’s solution would take the form

π∗ (dx , dy) = δ{T (x )} (dy) µ (dx) .

Blanchet (Columbia U. and Stanford U.) 8 / 60

Page 19: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Dual Problem

Primal has always a solution for c (·) ≥ 0 lower semicontinuous.

Linear programming (Dual):

supα,β

∫X

α (x) µ (dx) +∫Y

β (y) v (dy)

α (x) + β (y) ≤ c (x , y) ∀ (x , y) ∈ X ×Y .

Dual α and β can be taken over continuous functions.

Complementary slackness: Equality holds on the support of π∗

(primal optimizer).

Blanchet (Columbia U. and Stanford U.) 9 / 60

Page 20: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Dual Problem

Primal has always a solution for c (·) ≥ 0 lower semicontinuous.Linear programming (Dual):

supα,β

∫X

α (x) µ (dx) +∫Y

β (y) v (dy)

α (x) + β (y) ≤ c (x , y) ∀ (x , y) ∈ X ×Y .

Dual α and β can be taken over continuous functions.

Complementary slackness: Equality holds on the support of π∗

(primal optimizer).

Blanchet (Columbia U. and Stanford U.) 9 / 60

Page 21: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Dual Problem

Primal has always a solution for c (·) ≥ 0 lower semicontinuous.Linear programming (Dual):

supα,β

∫X

α (x) µ (dx) +∫Y

β (y) v (dy)

α (x) + β (y) ≤ c (x , y) ∀ (x , y) ∈ X ×Y .

Dual α and β can be taken over continuous functions.

Complementary slackness: Equality holds on the support of π∗

(primal optimizer).

Blanchet (Columbia U. and Stanford U.) 9 / 60

Page 22: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Dual Problem

Primal has always a solution for c (·) ≥ 0 lower semicontinuous.Linear programming (Dual):

supα,β

∫X

α (x) µ (dx) +∫Y

β (y) v (dy)

α (x) + β (y) ≤ c (x , y) ∀ (x , y) ∈ X ×Y .

Dual α and β can be taken over continuous functions.

Complementary slackness: Equality holds on the support of π∗

(primal optimizer).

Blanchet (Columbia U. and Stanford U.) 9 / 60

Page 23: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·).

Peter wants to cover a sinkhole, v (·).Cost for John and Peter to transport the sand to cover the sinkhole is

Dc (µ, v) =∫X×Y

c (x , y)π∗ (dx , dy) .

Now comes Maria, who has a business...

Maria promises to transport on behalf of John and Peter the wholeamount.

Blanchet (Columbia U. and Stanford U.) 10 / 60

Page 24: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·).Peter wants to cover a sinkhole, v (·).

Cost for John and Peter to transport the sand to cover the sinkhole is

Dc (µ, v) =∫X×Y

c (x , y)π∗ (dx , dy) .

Now comes Maria, who has a business...

Maria promises to transport on behalf of John and Peter the wholeamount.

Blanchet (Columbia U. and Stanford U.) 10 / 60

Page 25: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·).Peter wants to cover a sinkhole, v (·).Cost for John and Peter to transport the sand to cover the sinkhole is

Dc (µ, v) =∫X×Y

c (x , y)π∗ (dx , dy) .

Now comes Maria, who has a business...

Maria promises to transport on behalf of John and Peter the wholeamount.

Blanchet (Columbia U. and Stanford U.) 10 / 60

Page 26: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·).Peter wants to cover a sinkhole, v (·).Cost for John and Peter to transport the sand to cover the sinkhole is

Dc (µ, v) =∫X×Y

c (x , y)π∗ (dx , dy) .

Now comes Maria, who has a business...

Maria promises to transport on behalf of John and Peter the wholeamount.

Blanchet (Columbia U. and Stanford U.) 10 / 60

Page 27: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·).Peter wants to cover a sinkhole, v (·).Cost for John and Peter to transport the sand to cover the sinkhole is

Dc (µ, v) =∫X×Y

c (x , y)π∗ (dx , dy) .

Now comes Maria, who has a business...

Maria promises to transport on behalf of John and Peter the wholeamount.

Blanchet (Columbia U. and Stanford U.) 10 / 60

Page 28: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

Maria charges John α (x) per-unit of mass at x (similarly to Peter).

For Peter and John to agree we must have

α (x) + β (y) ≤ c (x , y) .

Maria wishes to maximize her profit∫α (x) µ (dx) +

∫β (y) v (dy) .

Kantorovich duality says primal and dual optimal values coincide and(under mild regularity)

α∗ (x) = infy{c (x , y)− β∗ (y)}

β∗ (y) = infx{c (x , y)− α∗ (x)} .

Blanchet (Columbia U. and Stanford U.) 11 / 60

Page 29: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

Maria charges John α (x) per-unit of mass at x (similarly to Peter).

For Peter and John to agree we must have

α (x) + β (y) ≤ c (x , y) .

Maria wishes to maximize her profit∫α (x) µ (dx) +

∫β (y) v (dy) .

Kantorovich duality says primal and dual optimal values coincide and(under mild regularity)

α∗ (x) = infy{c (x , y)− β∗ (y)}

β∗ (y) = infx{c (x , y)− α∗ (x)} .

Blanchet (Columbia U. and Stanford U.) 11 / 60

Page 30: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

Maria charges John α (x) per-unit of mass at x (similarly to Peter).

For Peter and John to agree we must have

α (x) + β (y) ≤ c (x , y) .

Maria wishes to maximize her profit∫α (x) µ (dx) +

∫β (y) v (dy) .

Kantorovich duality says primal and dual optimal values coincide and(under mild regularity)

α∗ (x) = infy{c (x , y)− β∗ (y)}

β∗ (y) = infx{c (x , y)− α∗ (x)} .

Blanchet (Columbia U. and Stanford U.) 11 / 60

Page 31: Optimal Transport Methods in Operations Research and

Kantorovich Relaxation: Primal Interpretation

Maria charges John α (x) per-unit of mass at x (similarly to Peter).

For Peter and John to agree we must have

α (x) + β (y) ≤ c (x , y) .

Maria wishes to maximize her profit∫α (x) µ (dx) +

∫β (y) v (dy) .

Kantorovich duality says primal and dual optimal values coincide and(under mild regularity)

α∗ (x) = infy{c (x , y)− β∗ (y)}

β∗ (y) = infx{c (x , y)− α∗ (x)} .

Blanchet (Columbia U. and Stanford U.) 11 / 60

Page 32: Optimal Transport Methods in Operations Research and

Proof Techniques

Suppose X and Y compact

supπ≥0,

infα,β

{∫X×Y

c (x , y)π (dx , dy)

−∫X×Y

α (x)π (dx , dy) +∫X

α (x) µ (dx)

−∫X×Y

β (y)π (dx , dy) +∫Y

β (y) v (dy)}

Swap sup and inf using Sion’s min-max theorem by a compactnessargument and conclude.

Significant amount of work needed to extend to general Polish spacesand construct the dual optimizers (primal a bit easier).

Blanchet (Columbia U. and Stanford U.) 12 / 60

Page 33: Optimal Transport Methods in Operations Research and

Proof Techniques

Suppose X and Y compact

supπ≥0,

infα,β

{∫X×Y

c (x , y)π (dx , dy)

−∫X×Y

α (x)π (dx , dy) +∫X

α (x) µ (dx)

−∫X×Y

β (y)π (dx , dy) +∫Y

β (y) v (dy)}

Swap sup and inf using Sion’s min-max theorem by a compactnessargument and conclude.

Significant amount of work needed to extend to general Polish spacesand construct the dual optimizers (primal a bit easier).

Blanchet (Columbia U. and Stanford U.) 12 / 60

Page 34: Optimal Transport Methods in Operations Research and

Proof Techniques

Suppose X and Y compact

supπ≥0,

infα,β

{∫X×Y

c (x , y)π (dx , dy)

−∫X×Y

α (x)π (dx , dy) +∫X

α (x) µ (dx)

−∫X×Y

β (y)π (dx , dy) +∫Y

β (y) v (dy)}

Swap sup and inf using Sion’s min-max theorem by a compactnessargument and conclude.

Significant amount of work needed to extend to general Polish spacesand construct the dual optimizers (primal a bit easier).

Blanchet (Columbia U. and Stanford U.) 12 / 60

Page 35: Optimal Transport Methods in Operations Research and

Optimal Transport Applications

Optimal Transport has gained popularity in many areasincluding: image analysis, economics, statistics, machine learning...

The rest of the talk mostly concerns applications to OR and Statisticsbut we’ll briefly touch upon others, including economics...

Blanchet (Columbia U. and Stanford U.) 13 / 60

Page 36: Optimal Transport Methods in Operations Research and

Illustration of Optimal Transport in Image Analysis

Santambrogio (2010)’s illustration

Blanchet (Columbia U. and Stanford U.) 14 / 60

Page 37: Optimal Transport Methods in Operations Research and

Application of Optimal Transport in Economics

Economic Interpretations(see e.g. A. Galichon’s 2016 textbook & McCaan 2013 notes).

Blanchet (Columbia U. and Stanford U.) 15 / 60

Page 38: Optimal Transport Methods in Operations Research and

Applications in Labor Markets

Worker with skill x & company with technology y have surplusΨ (x , y).

The population of workers is given by µ (x).

The population of companies is given by v (y).

The salary of worker x is α (x) & cost of technology y is β (y)

α (x) + β (y) ≥ Ψ (x , y) .

Companies want to minimize total production cost∫α (x) µ (x) dx +

∫β (y) v (y) dy

Blanchet (Columbia U. and Stanford U.) 16 / 60

Page 39: Optimal Transport Methods in Operations Research and

Applications in Labor Markets

Worker with skill x & company with technology y have surplusΨ (x , y).The population of workers is given by µ (x).

The population of companies is given by v (y).

The salary of worker x is α (x) & cost of technology y is β (y)

α (x) + β (y) ≥ Ψ (x , y) .

Companies want to minimize total production cost∫α (x) µ (x) dx +

∫β (y) v (y) dy

Blanchet (Columbia U. and Stanford U.) 16 / 60

Page 40: Optimal Transport Methods in Operations Research and

Applications in Labor Markets

Worker with skill x & company with technology y have surplusΨ (x , y).The population of workers is given by µ (x).

The population of companies is given by v (y).

The salary of worker x is α (x) & cost of technology y is β (y)

α (x) + β (y) ≥ Ψ (x , y) .

Companies want to minimize total production cost∫α (x) µ (x) dx +

∫β (y) v (y) dy

Blanchet (Columbia U. and Stanford U.) 16 / 60

Page 41: Optimal Transport Methods in Operations Research and

Applications in Labor Markets

Worker with skill x & company with technology y have surplusΨ (x , y).The population of workers is given by µ (x).

The population of companies is given by v (y).

The salary of worker x is α (x) & cost of technology y is β (y)

α (x) + β (y) ≥ Ψ (x , y) .

Companies want to minimize total production cost∫α (x) µ (x) dx +

∫β (y) v (y) dy

Blanchet (Columbia U. and Stanford U.) 16 / 60

Page 42: Optimal Transport Methods in Operations Research and

Applications in Labor Markets

Worker with skill x & company with technology y have surplusΨ (x , y).The population of workers is given by µ (x).

The population of companies is given by v (y).

The salary of worker x is α (x) & cost of technology y is β (y)

α (x) + β (y) ≥ Ψ (x , y) .

Companies want to minimize total production cost∫α (x) µ (x) dx +

∫β (y) v (y) dy

Blanchet (Columbia U. and Stanford U.) 16 / 60

Page 43: Optimal Transport Methods in Operations Research and

Applications in Labor Markets

Letting a central planner organize the Labor market

The planner wishes to maximize total surplus∫Ψ (x , y)π (dx , dy)

Over assignments π (·) which satisfy market clearing∫Y

π (dx , dy) = µ (dx) ,∫X

π (dx , dy) = v (dy) .

Blanchet (Columbia U. and Stanford U.) 17 / 60

Page 44: Optimal Transport Methods in Operations Research and

Applications in Labor Markets

Letting a central planner organize the Labor market

The planner wishes to maximize total surplus∫Ψ (x , y)π (dx , dy)

Over assignments π (·) which satisfy market clearing∫Y

π (dx , dy) = µ (dx) ,∫X

π (dx , dy) = v (dy) .

Blanchet (Columbia U. and Stanford U.) 17 / 60

Page 45: Optimal Transport Methods in Operations Research and

Applications in Labor Markets

Letting a central planner organize the Labor market

The planner wishes to maximize total surplus∫Ψ (x , y)π (dx , dy)

Over assignments π (·) which satisfy market clearing∫Y

π (dx , dy) = µ (dx) ,∫X

π (dx , dy) = v (dy) .

Blanchet (Columbia U. and Stanford U.) 17 / 60

Page 46: Optimal Transport Methods in Operations Research and

Solving for Optimal Transport Coupling

Suppose that Ψ (x , y) = xy , µ (x) = I (x ∈ [0, 1]),v (y) = e−y I (y > 0).

Solve primal by sampling: Let {X ni }ni=1 and {Y ni }

ni=1 both i.i.d. from

µ and v , respectively.

Fµn (x) =1n

n

∑i=1I (X ni ≤ x) , Fvn (y) =

1n

n

∑j=1I(Y nj ≤ y

)Consider

maxπ(x ni ,x nj )≥0

∑i ,j

Ψ(xni , y

nj

)π(xni , y

nj

)∑j

π(xni , y

nj

)=1n∀xi , ∑

iπ(xni , y

nj

)=1n∀yj .

Clearly, simply sort and match is the solution!

Blanchet (Columbia U. and Stanford U.) 18 / 60

Page 47: Optimal Transport Methods in Operations Research and

Solving for Optimal Transport Coupling

Suppose that Ψ (x , y) = xy , µ (x) = I (x ∈ [0, 1]),v (y) = e−y I (y > 0).

Solve primal by sampling: Let {X ni }ni=1 and {Y ni }

ni=1 both i.i.d. from

µ and v , respectively.

Fµn (x) =1n

n

∑i=1I (X ni ≤ x) , Fvn (y) =

1n

n

∑j=1I(Y nj ≤ y

)

Consider

maxπ(x ni ,x nj )≥0

∑i ,j

Ψ(xni , y

nj

)π(xni , y

nj

)∑j

π(xni , y

nj

)=1n∀xi , ∑

iπ(xni , y

nj

)=1n∀yj .

Clearly, simply sort and match is the solution!

Blanchet (Columbia U. and Stanford U.) 18 / 60

Page 48: Optimal Transport Methods in Operations Research and

Solving for Optimal Transport Coupling

Suppose that Ψ (x , y) = xy , µ (x) = I (x ∈ [0, 1]),v (y) = e−y I (y > 0).

Solve primal by sampling: Let {X ni }ni=1 and {Y ni }

ni=1 both i.i.d. from

µ and v , respectively.

Fµn (x) =1n

n

∑i=1I (X ni ≤ x) , Fvn (y) =

1n

n

∑j=1I(Y nj ≤ y

)Consider

maxπ(x ni ,x nj )≥0

∑i ,j

Ψ(xni , y

nj

)π(xni , y

nj

)∑j

π(xni , y

nj

)=1n∀xi , ∑

iπ(xni , y

nj

)=1n∀yj .

Clearly, simply sort and match is the solution!

Blanchet (Columbia U. and Stanford U.) 18 / 60

Page 49: Optimal Transport Methods in Operations Research and

Solving for Optimal Transport Coupling

Suppose that Ψ (x , y) = xy , µ (x) = I (x ∈ [0, 1]),v (y) = e−y I (y > 0).

Solve primal by sampling: Let {X ni }ni=1 and {Y ni }

ni=1 both i.i.d. from

µ and v , respectively.

Fµn (x) =1n

n

∑i=1I (X ni ≤ x) , Fvn (y) =

1n

n

∑j=1I(Y nj ≤ y

)Consider

maxπ(x ni ,x nj )≥0

∑i ,j

Ψ(xni , y

nj

)π(xni , y

nj

)∑j

π(xni , y

nj

)=1n∀xi , ∑

iπ(xni , y

nj

)=1n∀yj .

Clearly, simply sort and match is the solution!

Blanchet (Columbia U. and Stanford U.) 18 / 60

Page 50: Optimal Transport Methods in Operations Research and

Solving for Optimal Transport Coupling

Think of Y nj = − log(1− Unj

)for Unj s i.i.d. uniform(0, 1).

The j-th order statistic X n(j) is matched to Yn(j).

As n→ ∞, X n(nt) → t, so Y n(nt) → − log (1− t).Thus, the optimal coupling as n→ ∞ is X = U andY = − log (1− U) (comonotonic coupling).

Blanchet (Columbia U. and Stanford U.) 19 / 60

Page 51: Optimal Transport Methods in Operations Research and

Solving for Optimal Transport Coupling

Think of Y nj = − log(1− Unj

)for Unj s i.i.d. uniform(0, 1).

The j-th order statistic X n(j) is matched to Yn(j).

As n→ ∞, X n(nt) → t, so Y n(nt) → − log (1− t).Thus, the optimal coupling as n→ ∞ is X = U andY = − log (1− U) (comonotonic coupling).

Blanchet (Columbia U. and Stanford U.) 19 / 60

Page 52: Optimal Transport Methods in Operations Research and

Solving for Optimal Transport Coupling

Think of Y nj = − log(1− Unj

)for Unj s i.i.d. uniform(0, 1).

The j-th order statistic X n(j) is matched to Yn(j).

As n→ ∞, X n(nt) → t, so Y n(nt) → − log (1− t).

Thus, the optimal coupling as n→ ∞ is X = U andY = − log (1− U) (comonotonic coupling).

Blanchet (Columbia U. and Stanford U.) 19 / 60

Page 53: Optimal Transport Methods in Operations Research and

Solving for Optimal Transport Coupling

Think of Y nj = − log(1− Unj

)for Unj s i.i.d. uniform(0, 1).

The j-th order statistic X n(j) is matched to Yn(j).

As n→ ∞, X n(nt) → t, so Y n(nt) → − log (1− t).Thus, the optimal coupling as n→ ∞ is X = U andY = − log (1− U) (comonotonic coupling).

Blanchet (Columbia U. and Stanford U.) 19 / 60

Page 54: Optimal Transport Methods in Operations Research and

Identities for Wasserstein Distances

Comonotonic coupling is the solution if ∂2x ,yΨ (x , y) ≥ 0 -supermodularity.

Of for costs c (x , y) = −Ψ (x , y) if ∂2x ,y c (x , y) ≤ 0 (submodularity).Corollary: Suppose c (x , y) = |x − y | then X = F−1µ (U) andY = F−1v (U) thus

Dc(Fµ,Fv

)=∫ 1

0

∣∣∣F−1µ (u)− F−1v (u)∣∣∣ du.

Similar identities are common for Wasserstein distances...

Blanchet (Columbia U. and Stanford U.) 20 / 60

Page 55: Optimal Transport Methods in Operations Research and

Identities for Wasserstein Distances

Comonotonic coupling is the solution if ∂2x ,yΨ (x , y) ≥ 0 -supermodularity.

Of for costs c (x , y) = −Ψ (x , y) if ∂2x ,y c (x , y) ≤ 0 (submodularity).

Corollary: Suppose c (x , y) = |x − y | then X = F−1µ (U) andY = F−1v (U) thus

Dc(Fµ,Fv

)=∫ 1

0

∣∣∣F−1µ (u)− F−1v (u)∣∣∣ du.

Similar identities are common for Wasserstein distances...

Blanchet (Columbia U. and Stanford U.) 20 / 60

Page 56: Optimal Transport Methods in Operations Research and

Identities for Wasserstein Distances

Comonotonic coupling is the solution if ∂2x ,yΨ (x , y) ≥ 0 -supermodularity.

Of for costs c (x , y) = −Ψ (x , y) if ∂2x ,y c (x , y) ≤ 0 (submodularity).Corollary: Suppose c (x , y) = |x − y | then X = F−1µ (U) andY = F−1v (U) thus

Dc(Fµ,Fv

)=∫ 1

0

∣∣∣F−1µ (u)− F−1v (u)∣∣∣ du.

Similar identities are common for Wasserstein distances...

Blanchet (Columbia U. and Stanford U.) 20 / 60

Page 57: Optimal Transport Methods in Operations Research and

Identities for Wasserstein Distances

Comonotonic coupling is the solution if ∂2x ,yΨ (x , y) ≥ 0 -supermodularity.

Of for costs c (x , y) = −Ψ (x , y) if ∂2x ,y c (x , y) ≤ 0 (submodularity).Corollary: Suppose c (x , y) = |x − y | then X = F−1µ (U) andY = F−1v (U) thus

Dc(Fµ,Fv

)=∫ 1

0

∣∣∣F−1µ (u)− F−1v (u)∣∣∣ du.

Similar identities are common for Wasserstein distances...

Blanchet (Columbia U. and Stanford U.) 20 / 60

Page 58: Optimal Transport Methods in Operations Research and

Interesting Insight on Salary Effects

In equilibrium, by the envelope theorem

β̇∗(y) =

ddysupx[Ψ (x , y)− λ∗ (x)] =

∂yΨ (xy , y) = xy .

We also know that y = − log (1− x), or x = 1− exp (−y)

β∗ (y) = y + exp (−y)− 1+ β∗ (0) .

α∗ (x) + β∗ (− log (1− x)) = xy .

What if Ψ (x , y)→ Ψ (x , y) + f (x)? (i.e. productivity grows).Answer: salaries grows if f (·) is increasing.

Blanchet (Columbia U. and Stanford U.) 21 / 60

Page 59: Optimal Transport Methods in Operations Research and

Interesting Insight on Salary Effects

In equilibrium, by the envelope theorem

β̇∗(y) =

ddysupx[Ψ (x , y)− λ∗ (x)] =

∂yΨ (xy , y) = xy .

We also know that y = − log (1− x), or x = 1− exp (−y)

β∗ (y) = y + exp (−y)− 1+ β∗ (0) .

α∗ (x) + β∗ (− log (1− x)) = xy .

What if Ψ (x , y)→ Ψ (x , y) + f (x)? (i.e. productivity grows).Answer: salaries grows if f (·) is increasing.

Blanchet (Columbia U. and Stanford U.) 21 / 60

Page 60: Optimal Transport Methods in Operations Research and

Interesting Insight on Salary Effects

In equilibrium, by the envelope theorem

β̇∗(y) =

ddysupx[Ψ (x , y)− λ∗ (x)] =

∂yΨ (xy , y) = xy .

We also know that y = − log (1− x), or x = 1− exp (−y)

β∗ (y) = y + exp (−y)− 1+ β∗ (0) .

α∗ (x) + β∗ (− log (1− x)) = xy .

What if Ψ (x , y)→ Ψ (x , y) + f (x)? (i.e. productivity grows).

Answer: salaries grows if f (·) is increasing.

Blanchet (Columbia U. and Stanford U.) 21 / 60

Page 61: Optimal Transport Methods in Operations Research and

Interesting Insight on Salary Effects

In equilibrium, by the envelope theorem

β̇∗(y) =

ddysupx[Ψ (x , y)− λ∗ (x)] =

∂yΨ (xy , y) = xy .

We also know that y = − log (1− x), or x = 1− exp (−y)

β∗ (y) = y + exp (−y)− 1+ β∗ (0) .

α∗ (x) + β∗ (− log (1− x)) = xy .

What if Ψ (x , y)→ Ψ (x , y) + f (x)? (i.e. productivity grows).Answer: salaries grows if f (·) is increasing.

Blanchet (Columbia U. and Stanford U.) 21 / 60

Page 62: Optimal Transport Methods in Operations Research and

Applications of Optimal Transport in Stochastic OR

Application of Optimal Transport in Stochastic ORBlanchet and Murthy (2016)

https://arxiv.org/abs/1604.01446.

Insight: Diffusion approximations and optimal transport

Blanchet (Columbia U. and Stanford U.) 22 / 60

Page 63: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

In Stochastic OR we are often interested in evaluating

EPtrue (f (X ))

for a complex model Ptrue

Moreover, we wish to control / optimize it

minθEPtrue (h (X , θ)) .

Model Ptrue might be unknown or too diffi cult to work with.

So, we introduce a proxy P0 which provides a good trade-off betweentractability and model fidelity (e.g. Brownian motion for heavy-traffi capproximations).

Blanchet (Columbia U. and Stanford U.) 23 / 60

Page 64: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

In Stochastic OR we are often interested in evaluating

EPtrue (f (X ))

for a complex model PtrueMoreover, we wish to control / optimize it

minθEPtrue (h (X , θ)) .

Model Ptrue might be unknown or too diffi cult to work with.

So, we introduce a proxy P0 which provides a good trade-off betweentractability and model fidelity (e.g. Brownian motion for heavy-traffi capproximations).

Blanchet (Columbia U. and Stanford U.) 23 / 60

Page 65: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

In Stochastic OR we are often interested in evaluating

EPtrue (f (X ))

for a complex model PtrueMoreover, we wish to control / optimize it

minθEPtrue (h (X , θ)) .

Model Ptrue might be unknown or too diffi cult to work with.

So, we introduce a proxy P0 which provides a good trade-off betweentractability and model fidelity (e.g. Brownian motion for heavy-traffi capproximations).

Blanchet (Columbia U. and Stanford U.) 23 / 60

Page 66: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

In Stochastic OR we are often interested in evaluating

EPtrue (f (X ))

for a complex model PtrueMoreover, we wish to control / optimize it

minθEPtrue (h (X , θ)) .

Model Ptrue might be unknown or too diffi cult to work with.

So, we introduce a proxy P0 which provides a good trade-off betweentractability and model fidelity (e.g. Brownian motion for heavy-traffi capproximations).

Blanchet (Columbia U. and Stanford U.) 23 / 60

Page 67: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

For f (·) upper semicontinuous with EP0 |f (X )| < ∞

supEP (f (Y ))

Dc (P,P0) ≤ δ ,

X takes values on a Polish space and c (·) is lower semi-continuous.

Also an infinite dimensional linear program

sup∫X×Y

f (y)π (dx , dy)

s.t.∫X×Y

c (x , y)π (dx , dy) ≤ δ∫Y

π (dx , dy) = P0 (dx) .

Blanchet (Columbia U. and Stanford U.) 24 / 60

Page 68: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

For f (·) upper semicontinuous with EP0 |f (X )| < ∞

supEP (f (Y ))

Dc (P,P0) ≤ δ ,

X takes values on a Polish space and c (·) is lower semi-continuous.Also an infinite dimensional linear program

sup∫X×Y

f (y)π (dx , dy)

s.t.∫X×Y

c (x , y)π (dx , dy) ≤ δ∫Y

π (dx , dy) = P0 (dx) .

Blanchet (Columbia U. and Stanford U.) 24 / 60

Page 69: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

Formal duality:

Dual = infλ≥0,α

{λδ+

∫α (x)P0 (dx)

}λc (x , y) + α (x) ≥ f (y) .

B. & Murthy (2016) - No duality gap:

Dual = infλ≥0

[λδ+ E0

(supy{f (y)− λc (X , y)}

)].

We refer to this as RoPA Duality in this talk.

Let us consider the important case f (y) = I (y ∈ A) & c (x , x) = 0.

Blanchet (Columbia U. and Stanford U.) 25 / 60

Page 70: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

Formal duality:

Dual = infλ≥0,α

{λδ+

∫α (x)P0 (dx)

}λc (x , y) + α (x) ≥ f (y) .

B. & Murthy (2016) - No duality gap:

Dual = infλ≥0

[λδ+ E0

(supy{f (y)− λc (X , y)}

)].

We refer to this as RoPA Duality in this talk.

Let us consider the important case f (y) = I (y ∈ A) & c (x , x) = 0.

Blanchet (Columbia U. and Stanford U.) 25 / 60

Page 71: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

Formal duality:

Dual = infλ≥0,α

{λδ+

∫α (x)P0 (dx)

}λc (x , y) + α (x) ≥ f (y) .

B. & Murthy (2016) - No duality gap:

Dual = infλ≥0

[λδ+ E0

(supy{f (y)− λc (X , y)}

)].

We refer to this as RoPA Duality in this talk.

Let us consider the important case f (y) = I (y ∈ A) & c (x , x) = 0.

Blanchet (Columbia U. and Stanford U.) 25 / 60

Page 72: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

Formal duality:

Dual = infλ≥0,α

{λδ+

∫α (x)P0 (dx)

}λc (x , y) + α (x) ≥ f (y) .

B. & Murthy (2016) - No duality gap:

Dual = infλ≥0

[λδ+ E0

(supy{f (y)− λc (X , y)}

)].

We refer to this as RoPA Duality in this talk.

Let us consider the important case f (y) = I (y ∈ A) & c (x , x) = 0.

Blanchet (Columbia U. and Stanford U.) 25 / 60

Page 73: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

So, if f (y) = I (y ∈ A) and cA (X ) = inf{y ∈ A : c (x , y)}, then

Dual = infλ≥0

[λδ+ E0 (1− λcA (X ))

+]= P0 (cA (X ) ≤ 1/λ∗) .

If cA (X ) is continuous under P0 & E0 (cA (X )) ≥ δ, then

δ = E0 [cA (X ) I (cA (X ) ≤ 1/λ∗)] .

Blanchet (Columbia U. and Stanford U.) 26 / 60

Page 74: Optimal Transport Methods in Operations Research and

A Distributionally Robust Performance Analysis

So, if f (y) = I (y ∈ A) and cA (X ) = inf{y ∈ A : c (x , y)}, then

Dual = infλ≥0

[λδ+ E0 (1− λcA (X ))

+]= P0 (cA (X ) ≤ 1/λ∗) .

If cA (X ) is continuous under P0 & E0 (cA (X )) ≥ δ, then

δ = E0 [cA (X ) I (cA (X ) ≤ 1/λ∗)] .

Blanchet (Columbia U. and Stanford U.) 26 / 60

Page 75: Optimal Transport Methods in Operations Research and

Example: Model Uncertainty in Bankruptcy Calculations

R (t) = the reserve (perhaps multiple lines) at time t.

Bankruptcy probability (in finite time horizon T )

uT = Ptrue (R (t) ∈ B for some t ∈ [0,T ]) .

B is a set which models bankruptcy.

Problem: Model (Ptrue) may be complex, intractable or simplyunknown...

Blanchet (Columbia U. and Stanford U.) 27 / 60

Page 76: Optimal Transport Methods in Operations Research and

Example: Model Uncertainty in Bankruptcy Calculations

R (t) = the reserve (perhaps multiple lines) at time t.

Bankruptcy probability (in finite time horizon T )

uT = Ptrue (R (t) ∈ B for some t ∈ [0,T ]) .

B is a set which models bankruptcy.

Problem: Model (Ptrue) may be complex, intractable or simplyunknown...

Blanchet (Columbia U. and Stanford U.) 27 / 60

Page 77: Optimal Transport Methods in Operations Research and

Example: Model Uncertainty in Bankruptcy Calculations

R (t) = the reserve (perhaps multiple lines) at time t.

Bankruptcy probability (in finite time horizon T )

uT = Ptrue (R (t) ∈ B for some t ∈ [0,T ]) .

B is a set which models bankruptcy.

Problem: Model (Ptrue) may be complex, intractable or simplyunknown...

Blanchet (Columbia U. and Stanford U.) 27 / 60

Page 78: Optimal Transport Methods in Operations Research and

Example: Model Uncertainty in Bankruptcy Calculations

R (t) = the reserve (perhaps multiple lines) at time t.

Bankruptcy probability (in finite time horizon T )

uT = Ptrue (R (t) ∈ B for some t ∈ [0,T ]) .

B is a set which models bankruptcy.

Problem: Model (Ptrue) may be complex, intractable or simplyunknown...

Blanchet (Columbia U. and Stanford U.) 27 / 60

Page 79: Optimal Transport Methods in Operations Research and

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving

supDc (P0,P )≤δ

Ptrue (R (t) ∈ B for some t ∈ [0,T ]) ,

where P0 is a suitable model.

P0 = proxy for Ptrue .

P0 right trade-off between fidelity and tractability.

δ is the distributional uncertainty size.

Dc (·) is the distributional uncertainty region.

Blanchet (Columbia U. and Stanford U.) 28 / 60

Page 80: Optimal Transport Methods in Operations Research and

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving

supDc (P0,P )≤δ

Ptrue (R (t) ∈ B for some t ∈ [0,T ]) ,

where P0 is a suitable model.

P0 = proxy for Ptrue .

P0 right trade-off between fidelity and tractability.

δ is the distributional uncertainty size.

Dc (·) is the distributional uncertainty region.

Blanchet (Columbia U. and Stanford U.) 28 / 60

Page 81: Optimal Transport Methods in Operations Research and

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving

supDc (P0,P )≤δ

Ptrue (R (t) ∈ B for some t ∈ [0,T ]) ,

where P0 is a suitable model.

P0 = proxy for Ptrue .

P0 right trade-off between fidelity and tractability.

δ is the distributional uncertainty size.

Dc (·) is the distributional uncertainty region.

Blanchet (Columbia U. and Stanford U.) 28 / 60

Page 82: Optimal Transport Methods in Operations Research and

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving

supDc (P0,P )≤δ

Ptrue (R (t) ∈ B for some t ∈ [0,T ]) ,

where P0 is a suitable model.

P0 = proxy for Ptrue .

P0 right trade-off between fidelity and tractability.

δ is the distributional uncertainty size.

Dc (·) is the distributional uncertainty region.

Blanchet (Columbia U. and Stanford U.) 28 / 60

Page 83: Optimal Transport Methods in Operations Research and

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving

supDc (P0,P )≤δ

Ptrue (R (t) ∈ B for some t ∈ [0,T ]) ,

where P0 is a suitable model.

P0 = proxy for Ptrue .

P0 right trade-off between fidelity and tractability.

δ is the distributional uncertainty size.

Dc (·) is the distributional uncertainty region.

Blanchet (Columbia U. and Stanford U.) 28 / 60

Page 84: Optimal Transport Methods in Operations Research and

Desirable Elements of Distributionally Robust Formulation

Would like Dc (·) to have wide flexibility (even non-parametric).

Want optimization to be tractable.

Want to preserve advantages of using P0.

Want a way to estimate δ.

Blanchet (Columbia U. and Stanford U.) 29 / 60

Page 85: Optimal Transport Methods in Operations Research and

Desirable Elements of Distributionally Robust Formulation

Would like Dc (·) to have wide flexibility (even non-parametric).Want optimization to be tractable.

Want to preserve advantages of using P0.

Want a way to estimate δ.

Blanchet (Columbia U. and Stanford U.) 29 / 60

Page 86: Optimal Transport Methods in Operations Research and

Desirable Elements of Distributionally Robust Formulation

Would like Dc (·) to have wide flexibility (even non-parametric).Want optimization to be tractable.

Want to preserve advantages of using P0.

Want a way to estimate δ.

Blanchet (Columbia U. and Stanford U.) 29 / 60

Page 87: Optimal Transport Methods in Operations Research and

Desirable Elements of Distributionally Robust Formulation

Would like Dc (·) to have wide flexibility (even non-parametric).Want optimization to be tractable.

Want to preserve advantages of using P0.

Want a way to estimate δ.

Blanchet (Columbia U. and Stanford U.) 29 / 60

Page 88: Optimal Transport Methods in Operations Research and

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) -Hansen & Sargent (2016)

D (v ||µ) = Ev(log(dvdµ

)).

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009).

Big problem: Absolute continuity may typically be violated...Think of using Brownian motion as a proxy model for R (t)...

Optimal transport is a natural option!

Blanchet (Columbia U. and Stanford U.) 30 / 60

Page 89: Optimal Transport Methods in Operations Research and

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) -Hansen & Sargent (2016)

D (v ||µ) = Ev(log(dvdµ

)).

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009).

Big problem: Absolute continuity may typically be violated...Think of using Brownian motion as a proxy model for R (t)...

Optimal transport is a natural option!

Blanchet (Columbia U. and Stanford U.) 30 / 60

Page 90: Optimal Transport Methods in Operations Research and

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) -Hansen & Sargent (2016)

D (v ||µ) = Ev(log(dvdµ

)).

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009).

Big problem: Absolute continuity may typically be violated...

Think of using Brownian motion as a proxy model for R (t)...

Optimal transport is a natural option!

Blanchet (Columbia U. and Stanford U.) 30 / 60

Page 91: Optimal Transport Methods in Operations Research and

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) -Hansen & Sargent (2016)

D (v ||µ) = Ev(log(dvdµ

)).

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009).

Big problem: Absolute continuity may typically be violated...Think of using Brownian motion as a proxy model for R (t)...

Optimal transport is a natural option!

Blanchet (Columbia U. and Stanford U.) 30 / 60

Page 92: Optimal Transport Methods in Operations Research and

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) -Hansen & Sargent (2016)

D (v ||µ) = Ev(log(dvdµ

)).

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009).

Big problem: Absolute continuity may typically be violated...Think of using Brownian motion as a proxy model for R (t)...

Optimal transport is a natural option!

Blanchet (Columbia U. and Stanford U.) 30 / 60

Page 93: Optimal Transport Methods in Operations Research and

Application 1: Back to Classical Risk Problem

Suppose that

c (x , y) = dJ (x (·) , y (·)) = Skorokhod J1 metric.= inf

φ(·) bijection{ supt∈[0,1]

|x (t)− y (φ (t))| , supt∈[0,1]

|φ (t)− t|}.

If R (t) = b− Z (t), then ruin during time interval [0, 1] is

Bb = {R (·) : 0 ≥ inft∈[0,1]

R (t)} = {Z (·) : b ≤ supt∈[0,1]

Z (t)}.

Let P0 (·) be the Wiener measure want to compute

supDc (P0,P )≤δ

P (Z ∈ Bb) .

Blanchet (Columbia U. and Stanford U.) 31 / 60

Page 94: Optimal Transport Methods in Operations Research and

Application 1: Back to Classical Risk Problem

Suppose that

c (x , y) = dJ (x (·) , y (·)) = Skorokhod J1 metric.= inf

φ(·) bijection{ supt∈[0,1]

|x (t)− y (φ (t))| , supt∈[0,1]

|φ (t)− t|}.

If R (t) = b− Z (t), then ruin during time interval [0, 1] is

Bb = {R (·) : 0 ≥ inft∈[0,1]

R (t)} = {Z (·) : b ≤ supt∈[0,1]

Z (t)}.

Let P0 (·) be the Wiener measure want to compute

supDc (P0,P )≤δ

P (Z ∈ Bb) .

Blanchet (Columbia U. and Stanford U.) 31 / 60

Page 95: Optimal Transport Methods in Operations Research and

Application 1: Back to Classical Risk Problem

Suppose that

c (x , y) = dJ (x (·) , y (·)) = Skorokhod J1 metric.= inf

φ(·) bijection{ supt∈[0,1]

|x (t)− y (φ (t))| , supt∈[0,1]

|φ (t)− t|}.

If R (t) = b− Z (t), then ruin during time interval [0, 1] is

Bb = {R (·) : 0 ≥ inft∈[0,1]

R (t)} = {Z (·) : b ≤ supt∈[0,1]

Z (t)}.

Let P0 (·) be the Wiener measure want to compute

supDc (P0,P )≤δ

P (Z ∈ Bb) .

Blanchet (Columbia U. and Stanford U.) 31 / 60

Page 96: Optimal Transport Methods in Operations Research and

Application 1: Computing Distance to Bankruptcy

So: {cBb (Z ) ≤ 1/λ∗} = {supt∈[0,1] Z (t) ≥ b− 1/λ∗}, and

supDc (P0,P )≤δ

P (Z ∈ Bb) = P0

(supt∈[0,1]

Z (t) ≥ b− 1/λ∗).

Blanchet (Columbia U. and Stanford U.) 32 / 60

Page 97: Optimal Transport Methods in Operations Research and

Application 1: Computing Uncertainty Size

Note any coupling π so that πX = P0 and πY = P satisfies

Dc (P0,P) ≤ Eπ [c (X ,Y )] ≈ δ.

So use any coupling between evidence and P0 or expert knowledge.

We discuss choosing δ non-parametrically momentarily.

Blanchet (Columbia U. and Stanford U.) 33 / 60

Page 98: Optimal Transport Methods in Operations Research and

Application 1: Computing Uncertainty Size

Note any coupling π so that πX = P0 and πY = P satisfies

Dc (P0,P) ≤ Eπ [c (X ,Y )] ≈ δ.

So use any coupling between evidence and P0 or expert knowledge.

We discuss choosing δ non-parametrically momentarily.

Blanchet (Columbia U. and Stanford U.) 33 / 60

Page 99: Optimal Transport Methods in Operations Research and

Application 1: Computing Uncertainty Size

Note any coupling π so that πX = P0 and πY = P satisfies

Dc (P0,P) ≤ Eπ [c (X ,Y )] ≈ δ.

So use any coupling between evidence and P0 or expert knowledge.

We discuss choosing δ non-parametrically momentarily.

Blanchet (Columbia U. and Stanford U.) 33 / 60

Page 100: Optimal Transport Methods in Operations Research and

Application 1: Illustration of Coupling

Given arrivals and claim sizes let Z (t) = m−1/22 ∑

N (t)k=1 (Xk −m1)

Blanchet (Columbia U. and Stanford U.) 34 / 60

Page 101: Optimal Transport Methods in Operations Research and

Application 1: Coupling in Action

Blanchet (Columbia U. and Stanford U.) 35 / 60

Page 102: Optimal Transport Methods in Operations Research and

Application 1: Numerical Example

Assume Poisson arrivals.

Pareto claim sizes with index 2.2 —(P (V > t) = 1/(1+ t)2.2).Cost c (x , y) = dJ (x , y)

2 <—note power of 2.

Used Algorithm 1 to calibrate (estimating means and variances fromdata).

b P0(Ruin)Ptrue (Ruin)

P ∗robust (Ruin)Ptrue (Ruin)

100 1.07× 10−1 12.28150 2.52× 10−4 10.65200 5.35× 10−8 10.80250 1.15× 10−12 10.98

Blanchet (Columbia U. and Stanford U.) 36 / 60

Page 103: Optimal Transport Methods in Operations Research and

Additional Applications: Multidimensional Ruin Problems

https://arxiv.org/abs/1604.01446 contains more applications.

Control: minθ supP :D (P ,P0)≤δ E [L (θ,Z )] <—robust optimalreinsurance.

Multidimensional risk processes (explicit evaluation of cB (x) for dJmetric).Key insight: Geometry of target set often remains largely thesame!

Blanchet (Columbia U. and Stanford U.) 37 / 60

Page 104: Optimal Transport Methods in Operations Research and

Additional Applications: Multidimensional Ruin Problems

https://arxiv.org/abs/1604.01446 contains more applications.Control: minθ supP :D (P ,P0)≤δ E [L (θ,Z )] <—robust optimalreinsurance.

Multidimensional risk processes (explicit evaluation of cB (x) for dJmetric).Key insight: Geometry of target set often remains largely thesame!

Blanchet (Columbia U. and Stanford U.) 37 / 60

Page 105: Optimal Transport Methods in Operations Research and

Additional Applications: Multidimensional Ruin Problems

https://arxiv.org/abs/1604.01446 contains more applications.Control: minθ supP :D (P ,P0)≤δ E [L (θ,Z )] <—robust optimalreinsurance.

Multidimensional risk processes (explicit evaluation of cB (x) for dJmetric).

Key insight: Geometry of target set often remains largely thesame!

Blanchet (Columbia U. and Stanford U.) 37 / 60

Page 106: Optimal Transport Methods in Operations Research and

Additional Applications: Multidimensional Ruin Problems

https://arxiv.org/abs/1604.01446 contains more applications.Control: minθ supP :D (P ,P0)≤δ E [L (θ,Z )] <—robust optimalreinsurance.

Multidimensional risk processes (explicit evaluation of cB (x) for dJmetric).Key insight: Geometry of target set often remains largely thesame!Blanchet (Columbia U. and Stanford U.) 37 / 60

Page 107: Optimal Transport Methods in Operations Research and

Connections to Distributionally Robust Optimization

Based on:Robust Wasserstein Profile Inference (B., Murthy & Kang ’16)

https://arxiv.org/abs/1610.05627

Highlight: Additional insights into why optimal transport...

Blanchet (Columbia U. and Stanford U.) 38 / 60

Page 108: Optimal Transport Methods in Operations Research and

Distributionally Robust Optimization in Machine Learning

Consider estimating β∗ ∈ Rm in linear regression

Yi = βXi + ei ,

where {(Yi ,Xi )}ni=1 are data points.

Optimal Least Squares approach consists in estimating β∗ via

minβEPn

[(Y − βTX

)2]= min

β

1n

n

∑i=1

(Yi − βTXi

)2=

Apply the distributionally robust estimator based on optimaltransport.

Blanchet (Columbia U. and Stanford U.) 39 / 60

Page 109: Optimal Transport Methods in Operations Research and

Distributionally Robust Optimization in Machine Learning

Consider estimating β∗ ∈ Rm in linear regression

Yi = βXi + ei ,

where {(Yi ,Xi )}ni=1 are data points.Optimal Least Squares approach consists in estimating β∗ via

minβEPn

[(Y − βTX

)2]= min

β

1n

n

∑i=1

(Yi − βTXi

)2=

Apply the distributionally robust estimator based on optimaltransport.

Blanchet (Columbia U. and Stanford U.) 39 / 60

Page 110: Optimal Transport Methods in Operations Research and

Distributionally Robust Optimization in Machine Learning

Consider estimating β∗ ∈ Rm in linear regression

Yi = βXi + ei ,

where {(Yi ,Xi )}ni=1 are data points.Optimal Least Squares approach consists in estimating β∗ via

minβEPn

[(Y − βTX

)2]= min

β

1n

n

∑i=1

(Yi − βTXi

)2=

Apply the distributionally robust estimator based on optimaltransport.

Blanchet (Columbia U. and Stanford U.) 39 / 60

Page 111: Optimal Transport Methods in Operations Research and

Connection to Sqrt-Lasso

Theorem (B., Kang, Murthy (2016)) Suppose that

c((x , y) ,

(x ′, y ′

))=

{‖x − x ′‖2q if y = y ′

∞ if y 6= y ′ .

Then, if 1/p + 1/q = 1

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= E 1/2

Pn

[(Y − βTX

)2]+√

δ ‖β‖p .

Remark 1: This is sqrt-Lasso (Belloni et al. (2011)).Remark 2: Uses RoPA duality theorem & "judicious choice of c (·) ”

Blanchet (Columbia U. and Stanford U.) 40 / 60

Page 112: Optimal Transport Methods in Operations Research and

Connection to Regularized Logistic Regression

Theorem (B., Kang, Murthy (2016)) Suppose that

c((x , y) ,

(x ′, y ′

))=

{‖x − x ′‖q if y = y ′

∞ if y 6= y ′ .

Then,

supP : Dc (P ,Pn)≤δ

EP[log(1+ e−Y βTX )

]= EPn

[log(1+ e−Y βTX )

]+ δ ‖β‖p .

Remark 1: Approximate connection studied in Esfahani and Kuhn (2015).

Blanchet (Columbia U. and Stanford U.) 41 / 60

Page 113: Optimal Transport Methods in Operations Research and

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transportrecovers many other estimators...

Support Vector Machines: B., Kang, Murthy (2016) -https://arxiv.org/abs/1610.05627Group Lasso: B., & Kang (2016):https://arxiv.org/abs/1705.04241Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017):https://arxiv.org/abs/1705.07152Semisupervised learning: B., and Kang (2016):https://arxiv.org/abs/1702.08848

Blanchet (Columbia U. and Stanford U.) 42 / 60

Page 114: Optimal Transport Methods in Operations Research and

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transportrecovers many other estimators...

Support Vector Machines: B., Kang, Murthy (2016) -https://arxiv.org/abs/1610.05627

Group Lasso: B., & Kang (2016):https://arxiv.org/abs/1705.04241Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017):https://arxiv.org/abs/1705.07152Semisupervised learning: B., and Kang (2016):https://arxiv.org/abs/1702.08848

Blanchet (Columbia U. and Stanford U.) 42 / 60

Page 115: Optimal Transport Methods in Operations Research and

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transportrecovers many other estimators...

Support Vector Machines: B., Kang, Murthy (2016) -https://arxiv.org/abs/1610.05627Group Lasso: B., & Kang (2016):https://arxiv.org/abs/1705.04241

Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017):https://arxiv.org/abs/1705.07152Semisupervised learning: B., and Kang (2016):https://arxiv.org/abs/1702.08848

Blanchet (Columbia U. and Stanford U.) 42 / 60

Page 116: Optimal Transport Methods in Operations Research and

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transportrecovers many other estimators...

Support Vector Machines: B., Kang, Murthy (2016) -https://arxiv.org/abs/1610.05627Group Lasso: B., & Kang (2016):https://arxiv.org/abs/1705.04241Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017):https://arxiv.org/abs/1705.07152

Semisupervised learning: B., and Kang (2016):https://arxiv.org/abs/1702.08848

Blanchet (Columbia U. and Stanford U.) 42 / 60

Page 117: Optimal Transport Methods in Operations Research and

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transportrecovers many other estimators...

Support Vector Machines: B., Kang, Murthy (2016) -https://arxiv.org/abs/1610.05627Group Lasso: B., & Kang (2016):https://arxiv.org/abs/1705.04241Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017):https://arxiv.org/abs/1705.07152Semisupervised learning: B., and Kang (2016):https://arxiv.org/abs/1702.08848

Blanchet (Columbia U. and Stanford U.) 42 / 60

Page 118: Optimal Transport Methods in Operations Research and

How Regularization and Dual Norms Arise?

Let us work out a simple example...

Recall RoPA Duality: Pick c ((x , y) , (x ′, y ′)) = ‖(x , y)− (x ′, y ′)‖2q

maxP :Dc (P ,Pn)≤δ

EP(((X ,Y ) · (β, 1))2

)= min

λ≥0

{λδ+ EPn sup

(x ′,y ′)

[((x ′, y ′

)· (β, 1)

)2 − λ∥∥(X ,Y )− (x ′, y ′)∥∥2q]

}.

Let’s focus on the inside EPn ...

Blanchet (Columbia U. and Stanford U.) 43 / 60

Page 119: Optimal Transport Methods in Operations Research and

How Regularization and Dual Norms Arise?

Let us work out a simple example...

Recall RoPA Duality: Pick c ((x , y) , (x ′, y ′)) = ‖(x , y)− (x ′, y ′)‖2q

maxP :Dc (P ,Pn)≤δ

EP(((X ,Y ) · (β, 1))2

)= min

λ≥0

{λδ+ EPn sup

(x ′,y ′)

[((x ′, y ′

)· (β, 1)

)2 − λ∥∥(X ,Y )− (x ′, y ′)∥∥2q]

}.

Let’s focus on the inside EPn ...

Blanchet (Columbia U. and Stanford U.) 43 / 60

Page 120: Optimal Transport Methods in Operations Research and

How Regularization and Dual Norms Arise?

Let us work out a simple example...

Recall RoPA Duality: Pick c ((x , y) , (x ′, y ′)) = ‖(x , y)− (x ′, y ′)‖2q

maxP :Dc (P ,Pn)≤δ

EP(((X ,Y ) · (β, 1))2

)= min

λ≥0

{λδ+ EPn sup

(x ′,y ′)

[((x ′, y ′

)· (β, 1)

)2 − λ∥∥(X ,Y )− (x ′, y ′)∥∥2q]

}.

Let’s focus on the inside EPn ...

Blanchet (Columbia U. and Stanford U.) 43 / 60

Page 121: Optimal Transport Methods in Operations Research and

How Regularization and Dual Norms Arise?

Let ∆ = (X ,Y )− (x ′, y ′)

sup(x ′,y ′)

[((x ′, y ′

)· (β, 1)

)2 − λ∥∥(X ,Y )− (x ′, y ′)∥∥2q]

= sup∆

[((X ,Y ) · (β, 1)− ∆ · (β, 1))2 − λ ‖∆‖2q

]= sup

‖∆‖q

[(|(X ,Y ) · (β, 1)|+ ‖∆‖q ‖(β, 1)‖p)2 − λ ‖∆‖2q

]

Last equality uses z → z2 is symmetric around origin and|a · b| ≤ ‖a‖p ‖b‖q .Note problem is now one-dimensional (easily computable).

Blanchet (Columbia U. and Stanford U.) 44 / 60

Page 122: Optimal Transport Methods in Operations Research and

How Regularization and Dual Norms Arise?

Let ∆ = (X ,Y )− (x ′, y ′)

sup(x ′,y ′)

[((x ′, y ′

)· (β, 1)

)2 − λ∥∥(X ,Y )− (x ′, y ′)∥∥2q]

= sup∆

[((X ,Y ) · (β, 1)− ∆ · (β, 1))2 − λ ‖∆‖2q

]= sup

‖∆‖q

[(|(X ,Y ) · (β, 1)|+ ‖∆‖q ‖(β, 1)‖p)2 − λ ‖∆‖2q

]Last equality uses z → z2 is symmetric around origin and|a · b| ≤ ‖a‖p ‖b‖q .

Note problem is now one-dimensional (easily computable).

Blanchet (Columbia U. and Stanford U.) 44 / 60

Page 123: Optimal Transport Methods in Operations Research and

How Regularization and Dual Norms Arise?

Let ∆ = (X ,Y )− (x ′, y ′)

sup(x ′,y ′)

[((x ′, y ′

)· (β, 1)

)2 − λ∥∥(X ,Y )− (x ′, y ′)∥∥2q]

= sup∆

[((X ,Y ) · (β, 1)− ∆ · (β, 1))2 − λ ‖∆‖2q

]= sup

‖∆‖q

[(|(X ,Y ) · (β, 1)|+ ‖∆‖q ‖(β, 1)‖p)2 − λ ‖∆‖2q

]Last equality uses z → z2 is symmetric around origin and|a · b| ≤ ‖a‖p ‖b‖q .Note problem is now one-dimensional (easily computable).

Blanchet (Columbia U. and Stanford U.) 44 / 60

Page 124: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).

Suppose that ‖x − x ′‖2A = (x − x ′)A (x − x) with A positive definite(Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖A−1 .

Intuition: Think of A diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 45 / 60

Page 125: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).Suppose that ‖x − x ′‖2A = (x − x ′)A (x − x) with A positive definite(Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖A−1 .

Intuition: Think of A diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 45 / 60

Page 126: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).Suppose that ‖x − x ′‖2A = (x − x ′)A (x − x) with A positive definite(Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖A−1 .

Intuition: Think of A diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 45 / 60

Page 127: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).Suppose that ‖x − x ′‖2A = (x − x ′)A (x − x) with A positive definite(Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖A−1 .

Intuition: Think of A diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 45 / 60

Page 128: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).Suppose that ‖x − x ′‖2A = (x − x ′)A (x − x) with A positive definite(Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖A−1 .

Intuition: Think of A diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 45 / 60

Page 129: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).

Suppose that ‖x − x ′‖2Λ = (x − x ′)Λ (x − x) with Λ positivedefinite (Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖Λ−1 .

Intuition: Think of Λ diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 46 / 60

Page 130: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).Suppose that ‖x − x ′‖2Λ = (x − x ′)Λ (x − x) with Λ positivedefinite (Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖Λ−1 .

Intuition: Think of Λ diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 46 / 60

Page 131: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).Suppose that ‖x − x ′‖2Λ = (x − x ′)Λ (x − x) with Λ positivedefinite (Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖Λ−1 .

Intuition: Think of Λ diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 46 / 60

Page 132: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).Suppose that ‖x − x ′‖2Λ = (x − x ′)Λ (x − x) with Λ positivedefinite (Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖Λ−1 .

Intuition: Think of Λ diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 46 / 60

Page 133: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).Suppose that ‖x − x ′‖2Λ = (x − x ′)Λ (x − x) with Λ positivedefinite (Mahalanobis distance).

Then,

maxP :Dc (P ,Pn)≤δ

E 1/2P

((Y − βTX

)2)= min

βE 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖Λ−1 .

Intuition: Think of Λ diagonal, encoding inverse variability of Xi s...

High variability – > cheap transportation – > high impact inrisk estimation.

Blanchet (Columbia U. and Stanford U.) 46 / 60

Page 134: Optimal Transport Methods in Operations Research and

On Role of Transport Cost...

Comparing L1 regularization vs data-driven cost regularization:real data

BC BN QSAR Magic3*LRL1 Train .185± .123 .080± .030 .614± .038 .548± .087

Test .428± .338 .340± .228 .755± .019 .610± .050Accur .929± .023 .930± .042 .646± .036 .665± .045

3*DRO-NL Train .032± .015 .113± .035 .339± .044 .381± .084Test .119± .044 .194± .067 .554± .032 .576± .049Accur .955± .016 .931± .036 .736± .027 .730± .043

Num Predictors 30 4 30 10Train Size 40 20 80 30Test Size 329 752 475 9990

Table: Numerical results for real data sets.

Blanchet (Columbia U. and Stanford U.) 47 / 60

Page 135: Optimal Transport Methods in Operations Research and

Connections to Statistical Analysis

Based on:Robust Wasserstein Profile Inference (B., Murthy & Kang ’16)

https://arxiv.org/abs/1610.05627

Highlight: How to choose size of uncertainty?

Blanchet (Columbia U. and Stanford U.) 48 / 60

Page 136: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

How to choose uncertainty size in a data-driven way?

Once again, consider Lasso as example:

minβ

maxP :Dc (P ,Pn)≤δ

EP

((Y − βTX

)2)= min

β

{E 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖p}2.

Use left hand side to define a statistical principle to choose δ.

Important: Optimizing δ is equivalent to optimizing regularization!

Blanchet (Columbia U. and Stanford U.) 49 / 60

Page 137: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

How to choose uncertainty size in a data-driven way?

Once again, consider Lasso as example:

minβ

maxP :Dc (P ,Pn)≤δ

EP

((Y − βTX

)2)= min

β

{E 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖p}2.

Use left hand side to define a statistical principle to choose δ.

Important: Optimizing δ is equivalent to optimizing regularization!

Blanchet (Columbia U. and Stanford U.) 49 / 60

Page 138: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

How to choose uncertainty size in a data-driven way?

Once again, consider Lasso as example:

minβ

maxP :Dc (P ,Pn)≤δ

EP

((Y − βTX

)2)= min

β

{E 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖p}2.

Use left hand side to define a statistical principle to choose δ.

Important: Optimizing δ is equivalent to optimizing regularization!

Blanchet (Columbia U. and Stanford U.) 49 / 60

Page 139: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

How to choose uncertainty size in a data-driven way?

Once again, consider Lasso as example:

minβ

maxP :Dc (P ,Pn)≤δ

EP

((Y − βTX

)2)= min

β

{E 1/2Pn

[(Y − βTX

)2]+√

δ ‖β‖p}2.

Use left hand side to define a statistical principle to choose δ.

Important: Optimizing δ is equivalent to optimizing regularization!

Blanchet (Columbia U. and Stanford U.) 49 / 60

Page 140: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

“Standard”way to pick δ (Esfahani and Kuhn (2015)).

Estimate D (Ptrue ,Pn) using concentration of measure results.

Not a good idea: rate of convergence of the form O(1/n1/d ) (d is

the data dimension).

Instead we seek an optimal approach.

Blanchet (Columbia U. and Stanford U.) 50 / 60

Page 141: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

“Standard”way to pick δ (Esfahani and Kuhn (2015)).

Estimate D (Ptrue ,Pn) using concentration of measure results.

Not a good idea: rate of convergence of the form O(1/n1/d ) (d is

the data dimension).

Instead we seek an optimal approach.

Blanchet (Columbia U. and Stanford U.) 50 / 60

Page 142: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

“Standard”way to pick δ (Esfahani and Kuhn (2015)).

Estimate D (Ptrue ,Pn) using concentration of measure results.

Not a good idea: rate of convergence of the form O(1/n1/d ) (d is

the data dimension).

Instead we seek an optimal approach.

Blanchet (Columbia U. and Stanford U.) 50 / 60

Page 143: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

“Standard”way to pick δ (Esfahani and Kuhn (2015)).

Estimate D (Ptrue ,Pn) using concentration of measure results.

Not a good idea: rate of convergence of the form O(1/n1/d ) (d is

the data dimension).

Instead we seek an optimal approach.

Blanchet (Columbia U. and Stanford U.) 50 / 60

Page 144: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

Keep in mind linear regression problem

Yi = βT∗ Xi + εi .

The plausible model variations of Pn are given by the set

Uδ (n) = {P : Dc (P,Pn) ≤ δ}.

Given P ∈ Uδ (n), define β̄ (P) = argminEP(Y − βTX

).

It is natural to say that

Λδ (n) = {β̄ (P) : P ∈ Uδ (n)}

are plausible estimates of β∗.

Blanchet (Columbia U. and Stanford U.) 51 / 60

Page 145: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

Keep in mind linear regression problem

Yi = βT∗ Xi + εi .

The plausible model variations of Pn are given by the set

Uδ (n) = {P : Dc (P,Pn) ≤ δ}.

Given P ∈ Uδ (n), define β̄ (P) = argminEP(Y − βTX

).

It is natural to say that

Λδ (n) = {β̄ (P) : P ∈ Uδ (n)}

are plausible estimates of β∗.

Blanchet (Columbia U. and Stanford U.) 51 / 60

Page 146: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

Keep in mind linear regression problem

Yi = βT∗ Xi + εi .

The plausible model variations of Pn are given by the set

Uδ (n) = {P : Dc (P,Pn) ≤ δ}.

Given P ∈ Uδ (n), define β̄ (P) = argminEP(Y − βTX

).

It is natural to say that

Λδ (n) = {β̄ (P) : P ∈ Uδ (n)}

are plausible estimates of β∗.

Blanchet (Columbia U. and Stanford U.) 51 / 60

Page 147: Optimal Transport Methods in Operations Research and

Towards an Optimal Choice of Uncertainty Size

Keep in mind linear regression problem

Yi = βT∗ Xi + εi .

The plausible model variations of Pn are given by the set

Uδ (n) = {P : Dc (P,Pn) ≤ δ}.

Given P ∈ Uδ (n), define β̄ (P) = argminEP(Y − βTX

).

It is natural to say that

Λδ (n) = {β̄ (P) : P ∈ Uδ (n)}

are plausible estimates of β∗.

Blanchet (Columbia U. and Stanford U.) 51 / 60

Page 148: Optimal Transport Methods in Operations Research and

Optimal Choice of Uncertainty Size

Given a confidence level 1− α we advocate choosing δ via

min δ

s.t. P (β∗ ∈ Λδ (n)) ≥ 1− α .

Equivalently: Find smallest confidence region Λδ (n) at level 1− α.

In simple words: Find the smallest δ so that β∗ is plausible withconfidence level 1− α.

Blanchet (Columbia U. and Stanford U.) 52 / 60

Page 149: Optimal Transport Methods in Operations Research and

Optimal Choice of Uncertainty Size

Given a confidence level 1− α we advocate choosing δ via

min δ

s.t. P (β∗ ∈ Λδ (n)) ≥ 1− α .

Equivalently: Find smallest confidence region Λδ (n) at level 1− α.

In simple words: Find the smallest δ so that β∗ is plausible withconfidence level 1− α.

Blanchet (Columbia U. and Stanford U.) 52 / 60

Page 150: Optimal Transport Methods in Operations Research and

Optimal Choice of Uncertainty Size

Given a confidence level 1− α we advocate choosing δ via

min δ

s.t. P (β∗ ∈ Λδ (n)) ≥ 1− α .

Equivalently: Find smallest confidence region Λδ (n) at level 1− α.

In simple words: Find the smallest δ so that β∗ is plausible withconfidence level 1− α.

Blanchet (Columbia U. and Stanford U.) 52 / 60

Page 151: Optimal Transport Methods in Operations Research and

The Robust Wasserstein Profile Function

The value β̄ (P) is characterized by

EP

(∇β

(Y − βTX

)2)= 2EP

((Y − βTX

)X)= 0.

Define the Robust Wasserstein Profile (RWP) Function:

Rn (β) = min{Dc (P,Pn) : EP((Y − βTX

)X)= 0}.

Note that

Rn (β∗) ≤ δ⇐⇒ β∗ ∈ Λδ (n) = {β̄ (P) : D (P,Pn) ≤ δ}.

So δ is 1− α quantile of Rn (β∗)!

Blanchet (Columbia U. and Stanford U.) 53 / 60

Page 152: Optimal Transport Methods in Operations Research and

The Robust Wasserstein Profile Function

The value β̄ (P) is characterized by

EP

(∇β

(Y − βTX

)2)= 2EP

((Y − βTX

)X)= 0.

Define the Robust Wasserstein Profile (RWP) Function:

Rn (β) = min{Dc (P,Pn) : EP((Y − βTX

)X)= 0}.

Note that

Rn (β∗) ≤ δ⇐⇒ β∗ ∈ Λδ (n) = {β̄ (P) : D (P,Pn) ≤ δ}.

So δ is 1− α quantile of Rn (β∗)!

Blanchet (Columbia U. and Stanford U.) 53 / 60

Page 153: Optimal Transport Methods in Operations Research and

The Robust Wasserstein Profile Function

The value β̄ (P) is characterized by

EP

(∇β

(Y − βTX

)2)= 2EP

((Y − βTX

)X)= 0.

Define the Robust Wasserstein Profile (RWP) Function:

Rn (β) = min{Dc (P,Pn) : EP((Y − βTX

)X)= 0}.

Note that

Rn (β∗) ≤ δ⇐⇒ β∗ ∈ Λδ (n) = {β̄ (P) : D (P,Pn) ≤ δ}.

So δ is 1− α quantile of Rn (β∗)!

Blanchet (Columbia U. and Stanford U.) 53 / 60

Page 154: Optimal Transport Methods in Operations Research and

The Robust Wasserstein Profile Function

The value β̄ (P) is characterized by

EP

(∇β

(Y − βTX

)2)= 2EP

((Y − βTX

)X)= 0.

Define the Robust Wasserstein Profile (RWP) Function:

Rn (β) = min{Dc (P,Pn) : EP((Y − βTX

)X)= 0}.

Note that

Rn (β∗) ≤ δ⇐⇒ β∗ ∈ Λδ (n) = {β̄ (P) : D (P,Pn) ≤ δ}.

So δ is 1− α quantile of Rn (β∗)!

Blanchet (Columbia U. and Stanford U.) 53 / 60

Page 155: Optimal Transport Methods in Operations Research and

The Robust Wasserstein Profile Function

Blanchet (Columbia U. and Stanford U.) 54 / 60

Page 156: Optimal Transport Methods in Operations Research and

Computing Optimal Regularization Parameter

Theorem (B., Murthy, Kang (2016)) Suppose that {(Yi ,Xi )}ni=1 is ani.i.d. sample with finite variance, with

c((x , y) ,

(x ′, y ′

))=

{‖x − x ′‖2q if y = y ′

∞ if y 6= y ′ ,

thennRn(β∗)⇒ L1,

where L1 is explicitly and

L1D≤ L2 :=

E [e2]

E [e2]− (E |e|)2‖N(0,Cov (X ))‖2q .

Remark: We recover same order of regularization (but L1 gives theoptimal constant!)

Blanchet (Columbia U. and Stanford U.) 55 / 60

Page 157: Optimal Transport Methods in Operations Research and

Discussion on Optimal Uncertainty Size

Optimal δ is of order O (1/n) as opposed to O(1/n1/d ) as

advocated in the standard approach.

We characterize the asymptotic constant (not only order) in optimalregularization:

P(L1 ≤ η1−α

)= 1− α.

Rn (β∗) is inspired by Empirical Likelihood —Owen (1988).

Lam & Zhou (2015) use Empirical Likelihood in DRO, but focus ondivergence.

Blanchet (Columbia U. and Stanford U.) 56 / 60

Page 158: Optimal Transport Methods in Operations Research and

Discussion on Optimal Uncertainty Size

Optimal δ is of order O (1/n) as opposed to O(1/n1/d ) as

advocated in the standard approach.

We characterize the asymptotic constant (not only order) in optimalregularization:

P(L1 ≤ η1−α

)= 1− α.

Rn (β∗) is inspired by Empirical Likelihood —Owen (1988).

Lam & Zhou (2015) use Empirical Likelihood in DRO, but focus ondivergence.

Blanchet (Columbia U. and Stanford U.) 56 / 60

Page 159: Optimal Transport Methods in Operations Research and

Discussion on Optimal Uncertainty Size

Optimal δ is of order O (1/n) as opposed to O(1/n1/d ) as

advocated in the standard approach.

We characterize the asymptotic constant (not only order) in optimalregularization:

P(L1 ≤ η1−α

)= 1− α.

Rn (β∗) is inspired by Empirical Likelihood —Owen (1988).

Lam & Zhou (2015) use Empirical Likelihood in DRO, but focus ondivergence.

Blanchet (Columbia U. and Stanford U.) 56 / 60

Page 160: Optimal Transport Methods in Operations Research and

Discussion on Optimal Uncertainty Size

Optimal δ is of order O (1/n) as opposed to O(1/n1/d ) as

advocated in the standard approach.

We characterize the asymptotic constant (not only order) in optimalregularization:

P(L1 ≤ η1−α

)= 1− α.

Rn (β∗) is inspired by Empirical Likelihood —Owen (1988).

Lam & Zhou (2015) use Empirical Likelihood in DRO, but focus ondivergence.

Blanchet (Columbia U. and Stanford U.) 56 / 60

Page 161: Optimal Transport Methods in Operations Research and

A Toy Example Illustrating Proof Techniques

Considermin

βmax

P :Dc (P ,Pn)≤δE[(Y − β)2

]with c (y , y ′) = (y − y ′)ρ and define

Rn (β) = minπ(dy ,du)≥0

∫(y − u)ρ π (dy , du) :∫

u∈Rπ (dy , du) =

1n

δ{Yi } (dy) ∀i ,

2∫ ∫

(u − β)π (dy , du) = 0.

Blanchet (Columbia U. and Stanford U.) 57 / 60

Page 162: Optimal Transport Methods in Operations Research and

A Toy Example Illustrating Proof Techniques

Dual linear programming problem: Plug in β = β∗

Rn (β∗) = supλ∈R

{−1n

n

∑i=1supu∈R

{λ(u − β∗)− |Yi − u|

ρ }}

= supλ∈R

{−λn ∑n

i=1(Yi − β∗)− 1n ∑n

i=1 supu∈R

{λ (u − Yi )− |Yi − u|ρ

} }= sup

λ

{−λ

n

n

∑i=1(Yi − β∗)− (ρ− 1)

∣∣∣∣λρ∣∣∣∣

ρρ−1}

=

∣∣∣∣∣1n n

∑i=1(Yi − β∗)

∣∣∣∣∣ρ

=1n1/2

∣∣N (0, σ2)∣∣ρ .

Blanchet (Columbia U. and Stanford U.) 58 / 60

Page 163: Optimal Transport Methods in Operations Research and

Discussion: Some Open Problems

Extensions: Optimal Transport with constrains, Optimal MartingaleTransport.

Computational methods: Typical approach is entropic regularization(new methods currently developed in the machine learning literature).

Blanchet (Columbia U. and Stanford U.) 59 / 60

Page 164: Optimal Transport Methods in Operations Research and

Discussion: Some Open Problems

Extensions: Optimal Transport with constrains, Optimal MartingaleTransport.

Computational methods: Typical approach is entropic regularization(new methods currently developed in the machine learning literature).

Blanchet (Columbia U. and Stanford U.) 59 / 60

Page 165: Optimal Transport Methods in Operations Research and

Conclusions

Optimal transport (OT) is a powerful tool based on linearprogramming.

OT costs are natural for computing model uncertainty.

OT can be used in path-space to quantify error in diffusionapproximations.

OT can be used for data-driven distributionally robust optimization.

Cost function in OT can be used to improve out-of-sampleperformance.

OT can be used for statistical inference using RWP function.

Blanchet (Columbia U. and Stanford U.) 60 / 60

Page 166: Optimal Transport Methods in Operations Research and

Conclusions

Optimal transport (OT) is a powerful tool based on linearprogramming.

OT costs are natural for computing model uncertainty.

OT can be used in path-space to quantify error in diffusionapproximations.

OT can be used for data-driven distributionally robust optimization.

Cost function in OT can be used to improve out-of-sampleperformance.

OT can be used for statistical inference using RWP function.

Blanchet (Columbia U. and Stanford U.) 60 / 60

Page 167: Optimal Transport Methods in Operations Research and

Conclusions

Optimal transport (OT) is a powerful tool based on linearprogramming.

OT costs are natural for computing model uncertainty.

OT can be used in path-space to quantify error in diffusionapproximations.

OT can be used for data-driven distributionally robust optimization.

Cost function in OT can be used to improve out-of-sampleperformance.

OT can be used for statistical inference using RWP function.

Blanchet (Columbia U. and Stanford U.) 60 / 60

Page 168: Optimal Transport Methods in Operations Research and

Conclusions

Optimal transport (OT) is a powerful tool based on linearprogramming.

OT costs are natural for computing model uncertainty.

OT can be used in path-space to quantify error in diffusionapproximations.

OT can be used for data-driven distributionally robust optimization.

Cost function in OT can be used to improve out-of-sampleperformance.

OT can be used for statistical inference using RWP function.

Blanchet (Columbia U. and Stanford U.) 60 / 60

Page 169: Optimal Transport Methods in Operations Research and

Conclusions

Optimal transport (OT) is a powerful tool based on linearprogramming.

OT costs are natural for computing model uncertainty.

OT can be used in path-space to quantify error in diffusionapproximations.

OT can be used for data-driven distributionally robust optimization.

Cost function in OT can be used to improve out-of-sampleperformance.

OT can be used for statistical inference using RWP function.

Blanchet (Columbia U. and Stanford U.) 60 / 60

Page 170: Optimal Transport Methods in Operations Research and

Conclusions

Optimal transport (OT) is a powerful tool based on linearprogramming.

OT costs are natural for computing model uncertainty.

OT can be used in path-space to quantify error in diffusionapproximations.

OT can be used for data-driven distributionally robust optimization.

Cost function in OT can be used to improve out-of-sampleperformance.

OT can be used for statistical inference using RWP function.

Blanchet (Columbia U. and Stanford U.) 60 / 60