applying mix de-anonymisation techniques for good · 34 how data can help us improve our services...

12
Applying mix de-anonymisation techniques for good Steven Murdoch University College London

Upload: others

Post on 07-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

Applying mix de-anonymisation

techniques for good

Steven MurdochUniversity College London

Page 2: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

TfL monitor wifi MAC addresses to track mobility

Page 3: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

TfL monitor wifi MAC addresses to track mobility

Page 4: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

Ticketing data doesn’t explain movements in stations

Northern line (southbound)

Victoria line (northbound)

Northern line (northbound)

3-5

minutes

Victoria line (southbound)

1-3

minutes

Figure 9: How customers move from the Northern line (Bank branch) northbound platform to the Victoria line southbound platform at Euston station

of customers use the shortest route at Euston station

68%

When choosing a route through a station, customers have different requirements and preferences. Detailed information about walk times and crowding can help people make the best decision for them. For example, someone with specific accessibility requirements, or passengers travelling with buggies or luggage, would benefit from knowing the least crowded path from one platform to another.

How does disruption affect my journey?Another potential benefit is the responsiveness of the data to real-time events. Currently, a great deal of our understanding of crowding is based on surveys that only show a snapshot of the network when it is operating without disruption. We then make assumptions about likely customer responses, based on their typical behaviour.

Customers choose different routes through stations, depending on their needs

Review of the TfL WiFi pilot 3534 How data can help us improve our services

Analysing station congestion at this level of detail under real circumstances can help us investigate the causes of crowding, understand the effect conditions at one station have on other locations, and identify the knock-on impact of our operational decisions. It means we can improve the way we handle disruptions in the future and minimise the impact on our customers.

Transport planningMuch of the information collected and analysed during our pilot can also help our transport planners improve our network.

Our transport planning teams face a number of challenges, from upgrading the rail network to support a growing London, to looking at what our city will require in the longer term. To do this, they need to understand how customers currently travel and how they are likely to do so in the future.

Using depersonalised ticketing data is a good way for us to understand demand, but it only provides an insight into gate-to-gate movements. For some areas, where stations are served by only one line, this is sufficient. However, much of our network is complex – with multiple interchange points and routing options, and many paths within stations. In these cases, our ticketing data cannot provide the detailed routing patterns required when planning our train services and station capacity.

To understand passengers’ route choices, we have relied on customer surveys at a select number of stations each year. These ask people about their journey including when and where they started, which services they used, the lines they travelled on, and where they transferred between services. The results are then modelled to provide a detailed picture of demand. However, because surveys take time to run and process, we can only use certain stations.

This insight helps us to understand which stations and lines require capacity enhancements. We then run our transport models to predict how the network will operate with the proposed improvements in place. We can also predict the impact any works will have on other stations, which informs customer communications and operational plans. We use this modelling information to make changes to timetables so we can provide additional capacity and a more reliable service.

As part of our pilot, we tested whether WiFi connection data could provide our planners with timely information that would help them understand demand today, and shape planning and modelling in the future.

Waterloo

King’sCross

1.4%Others

Waterloo

32%

King’sCross

OxfordCircus

Green Park

Waterloo

King’sCross

26.7%

Waterloo

King’sCross

LeicesterSquare 14.9%

Waterloo

King’sCross

WarrenStreet 6.2% 3.7%

Waterloo

King’sCross

BakerStreet

2.6%

Waterloo

King’sCross

GreenPark

2%

Waterloo

King’sCross

LondonBridge

King’sCross

Waterloo

Euston

2.4%

1.8%

Waterloo

King’sCross

Bank

Waterloo

King’sCross

Euston

1.7%

1.6%

Waterloo

King’sCross

PiccadillyCircus

1.2%

Waterloo

King’sCross

Green Park

OxfordCircus

BakerStreet

1.2%

Waterloo

King’sCross

HolbornBank

Waterloo

King’sCross

BakerStreet

0.2%

0.2%

Victoria

Westminster

Waterloo

King’sCross

0.1%

Waterloo

King’sCross

OxfordCircus

TottenhamCourt Road

0.1%

Waterloo

King’sCross

Bank

LiverpoolStreet

LondonBridge

Figure 14: Route options between King's Cross St. Pancras and Waterloo, and the proportion of devices on each one

Review of the TfL WiFi pilot 4140 How data can help us improve our services

Page 5: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

Ticketing data doesn’t explain movements in stations

Northern line (southbound)

Victoria line (northbound)

Northern line (northbound)

3-5

minutes

Victoria line (southbound)

1-3

minutes

Figure 9: How customers move from the Northern line (Bank branch) northbound platform to the Victoria line southbound platform at Euston station

of customers use the shortest route at Euston station

68%

When choosing a route through a station, customers have different requirements and preferences. Detailed information about walk times and crowding can help people make the best decision for them. For example, someone with specific accessibility requirements, or passengers travelling with buggies or luggage, would benefit from knowing the least crowded path from one platform to another.

How does disruption affect my journey?Another potential benefit is the responsiveness of the data to real-time events. Currently, a great deal of our understanding of crowding is based on surveys that only show a snapshot of the network when it is operating without disruption. We then make assumptions about likely customer responses, based on their typical behaviour.

Customers choose different routes through stations, depending on their needs

Review of the TfL WiFi pilot 3534 How data can help us improve our services

Analysing station congestion at this level of detail under real circumstances can help us investigate the causes of crowding, understand the effect conditions at one station have on other locations, and identify the knock-on impact of our operational decisions. It means we can improve the way we handle disruptions in the future and minimise the impact on our customers.

Transport planningMuch of the information collected and analysed during our pilot can also help our transport planners improve our network.

Our transport planning teams face a number of challenges, from upgrading the rail network to support a growing London, to looking at what our city will require in the longer term. To do this, they need to understand how customers currently travel and how they are likely to do so in the future.

Using depersonalised ticketing data is a good way for us to understand demand, but it only provides an insight into gate-to-gate movements. For some areas, where stations are served by only one line, this is sufficient. However, much of our network is complex – with multiple interchange points and routing options, and many paths within stations. In these cases, our ticketing data cannot provide the detailed routing patterns required when planning our train services and station capacity.

To understand passengers’ route choices, we have relied on customer surveys at a select number of stations each year. These ask people about their journey including when and where they started, which services they used, the lines they travelled on, and where they transferred between services. The results are then modelled to provide a detailed picture of demand. However, because surveys take time to run and process, we can only use certain stations.

This insight helps us to understand which stations and lines require capacity enhancements. We then run our transport models to predict how the network will operate with the proposed improvements in place. We can also predict the impact any works will have on other stations, which informs customer communications and operational plans. We use this modelling information to make changes to timetables so we can provide additional capacity and a more reliable service.

As part of our pilot, we tested whether WiFi connection data could provide our planners with timely information that would help them understand demand today, and shape planning and modelling in the future.

Waterloo

King’sCross

1.4%Others

Waterloo

32%

King’sCross

OxfordCircus

Green Park

Waterloo

King’sCross

26.7%

Waterloo

King’sCross

LeicesterSquare 14.9%

Waterloo

King’sCross

WarrenStreet 6.2% 3.7%

Waterloo

King’sCross

BakerStreet

2.6%

Waterloo

King’sCross

GreenPark

2%

Waterloo

King’sCross

LondonBridge

King’sCross

Waterloo

Euston

2.4%

1.8%

Waterloo

King’sCross

Bank

Waterloo

King’sCross

Euston

1.7%

1.6%

Waterloo

King’sCross

PiccadillyCircus

1.2%

Waterloo

King’sCross

Green Park

OxfordCircus

BakerStreet

1.2%

Waterloo

King’sCross

HolbornBank

Waterloo

King’sCross

BakerStreet

0.2%

0.2%

Victoria

Westminster

Waterloo

King’sCross

0.1%

Waterloo

King’sCross

OxfordCircus

TottenhamCourt Road

0.1%

Waterloo

King’sCross

Bank

LiverpoolStreet

LondonBridge

Figure 14: Route options between King's Cross St. Pancras and Waterloo, and the proportion of devices on each one

Review of the TfL WiFi pilot 4140 How data can help us improve our services

Page 6: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

We can simulate wifi observations in a station based on user profiles

Northern line (southbound)

Victoria line (northbound)

Northern line (northbound)

3-5

minutes

Victoria line (southbound)

1-3

minutes

Figure 9: How customers move from the Northern line (Bank branch) northbound platform to the Victoria line southbound platform at Euston station

of customers use the shortest route at Euston station

68%

When choosing a route through a station, customers have different requirements and preferences. Detailed information about walk times and crowding can help people make the best decision for them. For example, someone with specific accessibility requirements, or passengers travelling with buggies or luggage, would benefit from knowing the least crowded path from one platform to another.

How does disruption affect my journey?Another potential benefit is the responsiveness of the data to real-time events. Currently, a great deal of our understanding of crowding is based on surveys that only show a snapshot of the network when it is operating without disruption. We then make assumptions about likely customer responses, based on their typical behaviour.

Customers choose different routes through stations, depending on their needs

Review of the TfL WiFi pilot 3534 How data can help us improve our services

A

B

A→B (fast)

A→B (slow)

B→A (fast)

B→A (slow)

A→A

B→B

0 0.15 0.3

Page 7: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

Analysis of 64 bit MAC addresses gives good results

• Poisson arrival

• λ = 1 per s

• Normal distribution for walking speed

• σ = 30 s

• 100 simulations

• Each 1 day

Page 8: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

Truncated 16 bit MACs don’t work as well

• Poisson arrival

• λ = 1 per s

• Normal distribution for walking speed

• σ = 30 s

• 100 simulations

• Each 1 day

Page 9: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

User Profiles User Movements

Wifi network andphysical characteristics Anonymisation

Observations

Northern line (southbound)

Victoria line (northbound)

Northern line (northbound)

3-5

minutes

Victoria line (southbound)

1-3

minutes

Figure 9: How customers move from the Northern line (Bank branch) northbound platform to the Victoria line southbound platform at Euston station

of customers use the shortest route at Euston station

68%

When choosing a route through a station, customers have different requirements and preferences. Detailed information about walk times and crowding can help people make the best decision for them. For example, someone with specific accessibility requirements, or passengers travelling with buggies or luggage, would benefit from knowing the least crowded path from one platform to another.

How does disruption affect my journey?Another potential benefit is the responsiveness of the data to real-time events. Currently, a great deal of our understanding of crowding is based on surveys that only show a snapshot of the network when it is operating without disruption. We then make assumptions about likely customer responses, based on their typical behaviour.

Customers choose different routes through stations, depending on their needs

Review of the TfL WiFi pilot 3534 How data can help us improve our services

F2-75-5D-1D-C2-F8

Ψ1 Ψ2 Ψ3 .. Ψn

A → B (124 s)A → AB → A (240 s)

F2-75: 12:42:01pm A4-5F: 12:42:12pm F2-3C: 12:42:18pm 42-45: 12:42:22pm 52-B3: 12:42:24pm 82-C5: 12:42:59pm

Page 10: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

Vida: Bayesian inference to de-anonymize communications 5

Sen1

Senx

SenN

Rec1

Recx

Sender-Receiver Match

(M)

i1

ix

iN

o1

ox

oN

Anonymity System(A)

i1

ix

iN

o1

ox

oN

Assignment(F)

User Profiles(Y)

Rec1

Recx

RecN

Sen1

Senx

SenN

Observation(O)

Y1, ... , Yx , ... , YN

i1

ix

iN

o1

ox

oN

user

msg msg

msg msg

msg msg

RecNmsgmsgmsg

msg

Fig. 1. The generative model used for Bayesian inference in anonymous communica-tions.

We start by proposing a ‘forward’ generative model describing how messagesare generated and sent through the anonymity system. We then use Bayes ruleto ‘invert’ the problem and perform inference on the unknown quantities. Thebroad outline of the generative model is depicted in Figure 1.

An anonymity system is abstracted as containing Nuser users that send Nmsg

messages to each other. Each user is associated with a sending profile x describ-ing how they select their correspondents when sending a message. We assume,in this work, that those profiles are simple multinomial distributions, that aresampled independently when a message is to be sent to determine the receiver.We denote the collection of all sending profiles by = { x|x = 1 . . . Nuser}.

A given sequence of Nmsg senders out of the Nuser users of the system, de-noted by Sen1, . . . ,SenNmsg , send a message while we observe the system. Usingtheir sending profiles a corresponding sequence of receivers Rec1, . . . ,RecNmsg isselected to receive their messages. The probability of any receiver sequence iseasy to compute. We denote this matching between senders and receivers as M:

Pr[M| ] =Y

x2[1,Nmsg]

Pr[Senx ! Recx| x].

In parallel with the matching process where users choose their communicationpartners, an anonymity system A is used. This anonymity system is abstractedas a bipartite graph linking input messages ix with potential output messagesoy, regardless of the identity of their senders and receivers. We note that com-pleteness of the bipartite graph is not required by the model. The edges of thebipartite graph are weighted with wxy that is simply the probability of the inputmessage ix being output as oy: wxy = Pr[ix ! oy|A].

This anonymity system A is used to determine a particular assignment ofmessages according to the weights wxy. A single perfect matching on the bipartite

Page 11: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

Vida: Bayesian inference to de-anonymize communications 5

Sen1

Senx

SenN

Rec1

Recx

Sender-Receiver Match

(M)

i1

ix

iN

o1

ox

oN

Anonymity System(A)

i1

ix

iN

o1

ox

oN

Assignment(F)

User Profiles(Y)

Rec1

Recx

RecN

Sen1

Senx

SenN

Observation(O)

Y1, ... , Yx , ... , YN

i1

ix

iN

o1

ox

oN

user

msg msg

msg msg

msg msg

RecNmsgmsgmsg

msg

Fig. 1. The generative model used for Bayesian inference in anonymous communica-tions.

We start by proposing a ‘forward’ generative model describing how messagesare generated and sent through the anonymity system. We then use Bayes ruleto ‘invert’ the problem and perform inference on the unknown quantities. Thebroad outline of the generative model is depicted in Figure 1.

An anonymity system is abstracted as containing Nuser users that send Nmsg

messages to each other. Each user is associated with a sending profile x describ-ing how they select their correspondents when sending a message. We assume,in this work, that those profiles are simple multinomial distributions, that aresampled independently when a message is to be sent to determine the receiver.We denote the collection of all sending profiles by = { x|x = 1 . . . Nuser}.

A given sequence of Nmsg senders out of the Nuser users of the system, de-noted by Sen1, . . . ,SenNmsg , send a message while we observe the system. Usingtheir sending profiles a corresponding sequence of receivers Rec1, . . . ,RecNmsg isselected to receive their messages. The probability of any receiver sequence iseasy to compute. We denote this matching between senders and receivers as M:

Pr[M| ] =Y

x2[1,Nmsg]

Pr[Senx ! Recx| x].

In parallel with the matching process where users choose their communicationpartners, an anonymity system A is used. This anonymity system is abstractedas a bipartite graph linking input messages ix with potential output messagesoy, regardless of the identity of their senders and receivers. We note that com-pleteness of the bipartite graph is not required by the model. The edges of thebipartite graph are weighted with wxy that is simply the probability of the inputmessage ix being output as oy: wxy = Pr[ix ! oy|A].

This anonymity system A is used to determine a particular assignment ofmessages according to the weights wxy. A single perfect matching on the bipartite

• De-anonymisation techniques can improve user privacy!

• It is possible to infer customer mobility profiles from observations of anonymised MAC addresses

• Model wifi network and MAC address anonymization as a mix network

• Take into account reasonable prior beliefs of mobility patterns

Page 12: Applying mix de-anonymisation techniques for good · 34 How data can help us improve our services Review of the TfL WiFi pilot 35 Analysing station congestion at this level of detail

UCL InfoSec are hiring!

www.benthamsgaze.org

Post-docs, PhD students, …New UK cybersecurity centre for doctoral training (CDT) – 55+ PhD studentships over the next 8 years between Computer Science, Crime Science and Public Policy