research experiences with publicly available anonymized data

Research experiences with publicly available anonymized data.

John McHugh

RedJack, LLC

and

University of North Carolina

Predict Disclosure Control Workshop

February 2010

A Tale of two datasets

• LBL Anonymized packet header data.– Papers describing data

• The Devil and Packet Trace Anonymization• A First Look at Modern Enterprise Traffic

– I used it to investigate lossy compression of traces• Wireless pkt header traces from Dartmouth, CRAWDAD

– Poorly documented, badly anonymized– Use agreement precludes attacking anonymization– I have used it for numerous class projects

Issues

• To what extent does the anonymization interfere with use for research?– Conflicts with collection process

• Collection problems may be orthogonal to anonymization or not.

• Anonymization may make resolution of collection problems difficult or impossible

• Inadvertent attacks on anonymization

– Conflicts with understanding• All addresses are not created equal• Over anonymization may invite “harmless” attacks that lead

to potentially harmful results

The LBNL data

• Fragmentary observations (2 of 20+ router ports in rotation)

• Spread over 5 days in Oct. 2004 - Jan. 2005• Anonymization carefully constructed to counter

attacks known at the time.• Scan data treated separately with different

anonymization– Scan data is atypical - mostly ping / ping response– LBNL uses TRW to block SYN scans at its border

The question

• For NetFlow data, scan records often constitute as many as 90% of the total.

• Can we store these differently without compromising the long term utility of the data.– Lossless compression - perfect reconstruction– Lossy compression - “similar” reconstruction

• From a perspective of a year later, do the precise details of a scan that targets mostly unoccupied addresses matter?– I think not.

The role of anonymization

• For this question, anonymization is almost, but not quite, irrelevant.– We would like to be sure that we don’t throw away

data relevant to a successful attack (next slides)– The LBNL anonymization precludes this as we

cannot follow a responding victim in other activities– More importantly, the blocking of most scans at

the border prevents most large SYN scan attacks or makes them much more difficult to recognize.

Scan and infection• Upper figure

– Scanner targets /24– Density of line is volume– 168.192.20.163 responds

• Lower figure– Scan was 3106 TCP– MySQL password guess– Victim very active on this port

for several weeks– Destination is scanner’s IP

address– Also active on 139 UDP during

this period, again with scanner.

Results / Conclusion

• Compression of scans in LNBL data could reduce the volume of the scan records by 90% to 95%

• Some loss of precise time and pacing information• Loss of serious scans at border limits benefit in

archiving internally collected data• LNBL data not really suitable for study of typical

scans due to preemptive measures. Collection needed outside scan filter for this, and even so, the preemptive measures may bias knowledgeable scanners to search elsewhere.

Dartmouth CRAWDAD wireless data• 18 sniffers spread over campus Nov 03 - Feb 04

– Packet headers cut after ports for TCP/UDP IP other– Prefix preserving anonymization of all addresses

• All IP addresses, Platform portion of MAC addresses

– 160GB+ of gzipped packet header files– Converted to “degenerate” SiLK Netflow

• 1 flow / pkt• Hourly hierarchy year/month/day/<hourly sensor files>• Coded MAC addresses into flow record

• Possibly good set for class projects - late 03 worms -

but occasional time reversals - a few at ~3500 seconds

Problems - IP addresses• Almost no useful documentation of collection

– Either no records kept or fear of breaching IRB

– Much learned from detailed examination of packets / flows

– IP addresses given by DHCP with fairly short leases• Question need for IP anonymization w/o DHCP logs

• Can find 0.0.0.0 and 169.254.x.x ranges easily - others?

• Tracking IP / MAC relationship gives strange results

– Too many short transitions

– Wanted to assign constant pseudo IP / platform• Is DHCP global or per access point?

• Could get no answer, so tried to figure it out

Problems - Time

• Occasional time reversals / gaps up to almost 1 hour– Did not show up in all sniffers

• Thought might find same packet in multiple sniffers– direct case same MACs & IPs in two+ sniffers– indirect case same IP, but w->gw gw->w MACs

• Situation much worse than expected (next slide)– Sniffers apparently ran w/o ntp for first 1.5 months

• This could explain IP address inconsistencies

Selected sniffer to sniffer time skew

Attempt 1 - find Dartmouth at LBL

• Dartmouth is 129.170/16 LBNL has 128.3.0.0– in 10000001\8 and 10000000\8 (common 1000000x)

• Anon Dartmouth is 190.84/16 in 10111110/8– common 10111111x - LBNL in 10111111/8 or 191/8

• Found some SYNs to 191/8, so asked Vern what he had.– With {s,}dport and from 129.170/16 could match– Only had dport, no sport, 1 pair from same address– Unable to find match - Why?

Attempt 2 - backscatter telescope• Asked kc if the backscatter had 129.170/16 2/11-13/12• As it happens, there is a limited amount for Nov 8-11• This yielded one port match with 4 packets• There were also 4 unmatched packets

– Dartmouth IP address was not active at the time.– Access drops packets if address is not associated.– No outgoing so source is spoofed

• Other data did not match with wireless based on sPort / dPort. Assume either not wireless or inactive• Match rates 1 sniffer clock for 1 interval

The needle in the haystackdr #

sIP dIP sPort dPort flg sTimeDTHMS.

s

sens

or

dTim

e

<- Dart.CAIDA.233

0.28.54.49

55936

62318

RA

08T01:31:31.446

CAIDA

<- Dart.CAIDA.233

0.28.54.49

55936

62318

RA

09T08:34:58.179

CAID

A

<- Dart.CAIDA.233

0.28.54.49

55936

62318

RA

09T08:35:30.846

CAID

A

<- Dart.CAIDA.233

0.28.54.49

55936

62318

RA

10T02:50:33.895

CAID

A

-> 1

190.84.159.

8

12.28.43.255

55936

62318

RA

11T03:30:10.314

Re94

<- 1

Dart.CAIDA.233

0.28.54.49

55936

62318

RA

11T03:30:40.560

CAID

A

30.248

-> 2

190.84.159.

8

12.28.43.255

55936

62318

RA

11T04:16:37.534

Re94

<- 2

Dart.CAIDA.233

0.28.54.49

55936

62318

RA

11T04:17:07.857

CAID

A

30.323

-> 3

190.84.159.

8

12.28.43.255

55936

62318

RA

11T04:55:56.521

Re94

<- 3

Dart.CAIDA.233

0.28.54.49

55936

62318

RA

11T04:56:26.908

CAID

A

30.387

-> 4

190.84.159.

8

12.28.43.255

55936

62318

RA

11T05:09:39.639

Re94

<- 4

Dart.CAIDA.233

0.28.54.49

55936

62318

RA

11T05:10:10.049

CAID

A

30.410

So why should we care?

1. The internet is an open universe– It is constantly being probed by deliberate and

accidental events.– It is constantly being observed at many locations– With a little luck and a lot of patience it may be

possible to unravel the best efforts at anonymization, especially for specific targets.

– Cryptography provides wholesale, not retail protection.


2. Collecting data is very hard

– Dartmouth did not document the collection well.

– They did not look closely at what they had collected and apparently performed little or no general analyses before releasing the data.

– Admittedly, most of the research done with CRAWDAD data addresses mobility questions, but the clock problem affects that as well.

– Collection problems led to search for external interactions which could breach the anonymization, but ...


3. Presence of scanning worms has potential to completely undo anonymization.• Several students have examined data for worm

signs as term projects. Most worms active during 03/11 - 04/02 are there.

• We have tried to honor use agreement and have not looked at scan patterns in detail, but

• Anyone scanned by a wireless address at Dartmouth during this collection has a piece of the puzzle.

Was the data useful?

• The LBNL data is useful for its intended purpose.

– It was marginally useful for my analysis, but the limitation is the scan blocking, not anonymization.

– The collection mode further limits research targeted at platform characterization over time.

– The way LBNL operates limits the presence of interesting security events. Most do not happen.

– A more complete enterprise data set for a longer duration would be very useful, but would probably endanger the anonymity as more external interactions became traceable.

Was the data useful?• My hope for CRAWDAD data was to create clean data

set that could be used for pedagogical purposes.

– There are too many collection related problems remaining to declare victory

– I have worked at it off an on for about 4 years.

– Even with its problems, the data has been useful for a network analysis course. 1 MS thesis, several conference pubs, and about 5 gainfully employed former students

– We have carefully and sucessfully stepped around the anonymization requirement for the most part.

Conclusions

• Both the CRAWDAD and LBNL data sets have utility beyond the purpose of collection.– The anonymization and collection practices limit

the utility by closing off whole areas of interest, i.e. scan interaction for LBNL, ICMP for CRAWDAD

– Documenting collection and ensuring data soundness are orthogonal to anonymization, but they have the ability to interact in interesting ways.

– These interactions may limit the effectiveness of anonymization.

research experiences with publicly available anonymized data

Documents