applying mix de-anonymisation techniques for good · 34 how data can help us improve our services...
TRANSCRIPT
Applying mix de-anonymisation
techniques for good
Steven MurdochUniversity College London
TfL monitor wifi MAC addresses to track mobility
TfL monitor wifi MAC addresses to track mobility
Ticketing data doesn’t explain movements in stations
Northern line (southbound)
Victoria line (northbound)
Northern line (northbound)
3-5
minutes
Victoria line (southbound)
1-3
minutes
Figure 9: How customers move from the Northern line (Bank branch) northbound platform to the Victoria line southbound platform at Euston station
of customers use the shortest route at Euston station
68%
When choosing a route through a station, customers have different requirements and preferences. Detailed information about walk times and crowding can help people make the best decision for them. For example, someone with specific accessibility requirements, or passengers travelling with buggies or luggage, would benefit from knowing the least crowded path from one platform to another.
How does disruption affect my journey?Another potential benefit is the responsiveness of the data to real-time events. Currently, a great deal of our understanding of crowding is based on surveys that only show a snapshot of the network when it is operating without disruption. We then make assumptions about likely customer responses, based on their typical behaviour.
Customers choose different routes through stations, depending on their needs
Review of the TfL WiFi pilot 3534 How data can help us improve our services
Analysing station congestion at this level of detail under real circumstances can help us investigate the causes of crowding, understand the effect conditions at one station have on other locations, and identify the knock-on impact of our operational decisions. It means we can improve the way we handle disruptions in the future and minimise the impact on our customers.
Transport planningMuch of the information collected and analysed during our pilot can also help our transport planners improve our network.
Our transport planning teams face a number of challenges, from upgrading the rail network to support a growing London, to looking at what our city will require in the longer term. To do this, they need to understand how customers currently travel and how they are likely to do so in the future.
Using depersonalised ticketing data is a good way for us to understand demand, but it only provides an insight into gate-to-gate movements. For some areas, where stations are served by only one line, this is sufficient. However, much of our network is complex – with multiple interchange points and routing options, and many paths within stations. In these cases, our ticketing data cannot provide the detailed routing patterns required when planning our train services and station capacity.
To understand passengers’ route choices, we have relied on customer surveys at a select number of stations each year. These ask people about their journey including when and where they started, which services they used, the lines they travelled on, and where they transferred between services. The results are then modelled to provide a detailed picture of demand. However, because surveys take time to run and process, we can only use certain stations.
This insight helps us to understand which stations and lines require capacity enhancements. We then run our transport models to predict how the network will operate with the proposed improvements in place. We can also predict the impact any works will have on other stations, which informs customer communications and operational plans. We use this modelling information to make changes to timetables so we can provide additional capacity and a more reliable service.
As part of our pilot, we tested whether WiFi connection data could provide our planners with timely information that would help them understand demand today, and shape planning and modelling in the future.
Waterloo
King’sCross
1.4%Others
Waterloo
32%
King’sCross
OxfordCircus
Green Park
Waterloo
King’sCross
26.7%
Waterloo
King’sCross
LeicesterSquare 14.9%
Waterloo
King’sCross
WarrenStreet 6.2% 3.7%
Waterloo
King’sCross
BakerStreet
2.6%
Waterloo
King’sCross
GreenPark
2%
Waterloo
King’sCross
LondonBridge
King’sCross
Waterloo
Euston
2.4%
1.8%
Waterloo
King’sCross
Bank
Waterloo
King’sCross
Euston
1.7%
1.6%
Waterloo
King’sCross
PiccadillyCircus
1.2%
Waterloo
King’sCross
Green Park
OxfordCircus
BakerStreet
1.2%
Waterloo
King’sCross
HolbornBank
Waterloo
King’sCross
BakerStreet
0.2%
0.2%
Victoria
Westminster
Waterloo
King’sCross
0.1%
Waterloo
King’sCross
OxfordCircus
TottenhamCourt Road
0.1%
Waterloo
King’sCross
Bank
LiverpoolStreet
LondonBridge
Figure 14: Route options between King's Cross St. Pancras and Waterloo, and the proportion of devices on each one
Review of the TfL WiFi pilot 4140 How data can help us improve our services
Ticketing data doesn’t explain movements in stations
Northern line (southbound)
Victoria line (northbound)
Northern line (northbound)
3-5
minutes
Victoria line (southbound)
1-3
minutes
Figure 9: How customers move from the Northern line (Bank branch) northbound platform to the Victoria line southbound platform at Euston station
of customers use the shortest route at Euston station
68%
When choosing a route through a station, customers have different requirements and preferences. Detailed information about walk times and crowding can help people make the best decision for them. For example, someone with specific accessibility requirements, or passengers travelling with buggies or luggage, would benefit from knowing the least crowded path from one platform to another.
How does disruption affect my journey?Another potential benefit is the responsiveness of the data to real-time events. Currently, a great deal of our understanding of crowding is based on surveys that only show a snapshot of the network when it is operating without disruption. We then make assumptions about likely customer responses, based on their typical behaviour.
Customers choose different routes through stations, depending on their needs
Review of the TfL WiFi pilot 3534 How data can help us improve our services
Analysing station congestion at this level of detail under real circumstances can help us investigate the causes of crowding, understand the effect conditions at one station have on other locations, and identify the knock-on impact of our operational decisions. It means we can improve the way we handle disruptions in the future and minimise the impact on our customers.
Transport planningMuch of the information collected and analysed during our pilot can also help our transport planners improve our network.
Our transport planning teams face a number of challenges, from upgrading the rail network to support a growing London, to looking at what our city will require in the longer term. To do this, they need to understand how customers currently travel and how they are likely to do so in the future.
Using depersonalised ticketing data is a good way for us to understand demand, but it only provides an insight into gate-to-gate movements. For some areas, where stations are served by only one line, this is sufficient. However, much of our network is complex – with multiple interchange points and routing options, and many paths within stations. In these cases, our ticketing data cannot provide the detailed routing patterns required when planning our train services and station capacity.
To understand passengers’ route choices, we have relied on customer surveys at a select number of stations each year. These ask people about their journey including when and where they started, which services they used, the lines they travelled on, and where they transferred between services. The results are then modelled to provide a detailed picture of demand. However, because surveys take time to run and process, we can only use certain stations.
This insight helps us to understand which stations and lines require capacity enhancements. We then run our transport models to predict how the network will operate with the proposed improvements in place. We can also predict the impact any works will have on other stations, which informs customer communications and operational plans. We use this modelling information to make changes to timetables so we can provide additional capacity and a more reliable service.
As part of our pilot, we tested whether WiFi connection data could provide our planners with timely information that would help them understand demand today, and shape planning and modelling in the future.
Waterloo
King’sCross
1.4%Others
Waterloo
32%
King’sCross
OxfordCircus
Green Park
Waterloo
King’sCross
26.7%
Waterloo
King’sCross
LeicesterSquare 14.9%
Waterloo
King’sCross
WarrenStreet 6.2% 3.7%
Waterloo
King’sCross
BakerStreet
2.6%
Waterloo
King’sCross
GreenPark
2%
Waterloo
King’sCross
LondonBridge
King’sCross
Waterloo
Euston
2.4%
1.8%
Waterloo
King’sCross
Bank
Waterloo
King’sCross
Euston
1.7%
1.6%
Waterloo
King’sCross
PiccadillyCircus
1.2%
Waterloo
King’sCross
Green Park
OxfordCircus
BakerStreet
1.2%
Waterloo
King’sCross
HolbornBank
Waterloo
King’sCross
BakerStreet
0.2%
0.2%
Victoria
Westminster
Waterloo
King’sCross
0.1%
Waterloo
King’sCross
OxfordCircus
TottenhamCourt Road
0.1%
Waterloo
King’sCross
Bank
LiverpoolStreet
LondonBridge
Figure 14: Route options between King's Cross St. Pancras and Waterloo, and the proportion of devices on each one
Review of the TfL WiFi pilot 4140 How data can help us improve our services
We can simulate wifi observations in a station based on user profiles
Northern line (southbound)
Victoria line (northbound)
Northern line (northbound)
3-5
minutes
Victoria line (southbound)
1-3
minutes
Figure 9: How customers move from the Northern line (Bank branch) northbound platform to the Victoria line southbound platform at Euston station
of customers use the shortest route at Euston station
68%
When choosing a route through a station, customers have different requirements and preferences. Detailed information about walk times and crowding can help people make the best decision for them. For example, someone with specific accessibility requirements, or passengers travelling with buggies or luggage, would benefit from knowing the least crowded path from one platform to another.
How does disruption affect my journey?Another potential benefit is the responsiveness of the data to real-time events. Currently, a great deal of our understanding of crowding is based on surveys that only show a snapshot of the network when it is operating without disruption. We then make assumptions about likely customer responses, based on their typical behaviour.
Customers choose different routes through stations, depending on their needs
Review of the TfL WiFi pilot 3534 How data can help us improve our services
A
B
A→B (fast)
A→B (slow)
B→A (fast)
B→A (slow)
A→A
B→B
0 0.15 0.3
Analysis of 64 bit MAC addresses gives good results
• Poisson arrival
• λ = 1 per s
• Normal distribution for walking speed
• σ = 30 s
• 100 simulations
• Each 1 day
Truncated 16 bit MACs don’t work as well
• Poisson arrival
• λ = 1 per s
• Normal distribution for walking speed
• σ = 30 s
• 100 simulations
• Each 1 day
User Profiles User Movements
Wifi network andphysical characteristics Anonymisation
Observations
Northern line (southbound)
Victoria line (northbound)
Northern line (northbound)
3-5
minutes
Victoria line (southbound)
1-3
minutes
Figure 9: How customers move from the Northern line (Bank branch) northbound platform to the Victoria line southbound platform at Euston station
of customers use the shortest route at Euston station
68%
When choosing a route through a station, customers have different requirements and preferences. Detailed information about walk times and crowding can help people make the best decision for them. For example, someone with specific accessibility requirements, or passengers travelling with buggies or luggage, would benefit from knowing the least crowded path from one platform to another.
How does disruption affect my journey?Another potential benefit is the responsiveness of the data to real-time events. Currently, a great deal of our understanding of crowding is based on surveys that only show a snapshot of the network when it is operating without disruption. We then make assumptions about likely customer responses, based on their typical behaviour.
Customers choose different routes through stations, depending on their needs
Review of the TfL WiFi pilot 3534 How data can help us improve our services
F2-75-5D-1D-C2-F8
Ψ1 Ψ2 Ψ3 .. Ψn
A → B (124 s)A → AB → A (240 s)
F2-75: 12:42:01pm A4-5F: 12:42:12pm F2-3C: 12:42:18pm 42-45: 12:42:22pm 52-B3: 12:42:24pm 82-C5: 12:42:59pm
…
Vida: Bayesian inference to de-anonymize communications 5
Sen1
Senx
SenN
Rec1
Recx
Sender-Receiver Match
(M)
i1
ix
iN
o1
ox
oN
Anonymity System(A)
i1
ix
iN
o1
ox
oN
Assignment(F)
User Profiles(Y)
Rec1
Recx
RecN
Sen1
Senx
SenN
Observation(O)
Y1, ... , Yx , ... , YN
i1
ix
iN
o1
ox
oN
user
msg msg
msg msg
msg msg
RecNmsgmsgmsg
msg
Fig. 1. The generative model used for Bayesian inference in anonymous communica-tions.
We start by proposing a ‘forward’ generative model describing how messagesare generated and sent through the anonymity system. We then use Bayes ruleto ‘invert’ the problem and perform inference on the unknown quantities. Thebroad outline of the generative model is depicted in Figure 1.
An anonymity system is abstracted as containing Nuser users that send Nmsg
messages to each other. Each user is associated with a sending profile x describ-ing how they select their correspondents when sending a message. We assume,in this work, that those profiles are simple multinomial distributions, that aresampled independently when a message is to be sent to determine the receiver.We denote the collection of all sending profiles by = { x|x = 1 . . . Nuser}.
A given sequence of Nmsg senders out of the Nuser users of the system, de-noted by Sen1, . . . ,SenNmsg , send a message while we observe the system. Usingtheir sending profiles a corresponding sequence of receivers Rec1, . . . ,RecNmsg isselected to receive their messages. The probability of any receiver sequence iseasy to compute. We denote this matching between senders and receivers as M:
Pr[M| ] =Y
x2[1,Nmsg]
Pr[Senx ! Recx| x].
In parallel with the matching process where users choose their communicationpartners, an anonymity system A is used. This anonymity system is abstractedas a bipartite graph linking input messages ix with potential output messagesoy, regardless of the identity of their senders and receivers. We note that com-pleteness of the bipartite graph is not required by the model. The edges of thebipartite graph are weighted with wxy that is simply the probability of the inputmessage ix being output as oy: wxy = Pr[ix ! oy|A].
This anonymity system A is used to determine a particular assignment ofmessages according to the weights wxy. A single perfect matching on the bipartite
Vida: Bayesian inference to de-anonymize communications 5
Sen1
Senx
SenN
Rec1
Recx
Sender-Receiver Match
(M)
i1
ix
iN
o1
ox
oN
Anonymity System(A)
i1
ix
iN
o1
ox
oN
Assignment(F)
User Profiles(Y)
Rec1
Recx
RecN
Sen1
Senx
SenN
Observation(O)
Y1, ... , Yx , ... , YN
i1
ix
iN
o1
ox
oN
user
msg msg
msg msg
msg msg
RecNmsgmsgmsg
msg
Fig. 1. The generative model used for Bayesian inference in anonymous communica-tions.
We start by proposing a ‘forward’ generative model describing how messagesare generated and sent through the anonymity system. We then use Bayes ruleto ‘invert’ the problem and perform inference on the unknown quantities. Thebroad outline of the generative model is depicted in Figure 1.
An anonymity system is abstracted as containing Nuser users that send Nmsg
messages to each other. Each user is associated with a sending profile x describ-ing how they select their correspondents when sending a message. We assume,in this work, that those profiles are simple multinomial distributions, that aresampled independently when a message is to be sent to determine the receiver.We denote the collection of all sending profiles by = { x|x = 1 . . . Nuser}.
A given sequence of Nmsg senders out of the Nuser users of the system, de-noted by Sen1, . . . ,SenNmsg , send a message while we observe the system. Usingtheir sending profiles a corresponding sequence of receivers Rec1, . . . ,RecNmsg isselected to receive their messages. The probability of any receiver sequence iseasy to compute. We denote this matching between senders and receivers as M:
Pr[M| ] =Y
x2[1,Nmsg]
Pr[Senx ! Recx| x].
In parallel with the matching process where users choose their communicationpartners, an anonymity system A is used. This anonymity system is abstractedas a bipartite graph linking input messages ix with potential output messagesoy, regardless of the identity of their senders and receivers. We note that com-pleteness of the bipartite graph is not required by the model. The edges of thebipartite graph are weighted with wxy that is simply the probability of the inputmessage ix being output as oy: wxy = Pr[ix ! oy|A].
This anonymity system A is used to determine a particular assignment ofmessages according to the weights wxy. A single perfect matching on the bipartite
• De-anonymisation techniques can improve user privacy!
• It is possible to infer customer mobility profiles from observations of anonymised MAC addresses
• Model wifi network and MAC address anonymization as a mix network
• Take into account reasonable prior beliefs of mobility patterns
UCL InfoSec are hiring!
www.benthamsgaze.org
Post-docs, PhD students, …New UK cybersecurity centre for doctoral training (CDT) – 55+ PhD studentships over the next 8 years between Computer Science, Crime Science and Public Policy