sensitive information sweep

37
Sensitive Information Sweep Using Cornell’s Spider Wyman Miles, Cornell University Kerry Havens, University of Colorado at Boulder Steve Lovaas, Colorado State University

Upload: nicholai-dima

Post on 02-Jan-2016

17 views

Category:

Documents


1 download

DESCRIPTION

Sensitive Information Sweep. Using Cornell’s Spider Wyman Miles , Cornell University Kerry Havens , University of Colorado at Boulder Steve Lovaas , Colorado State University. Overview. Quick Background The Technical Problem (Kerry) The Organizational Problem (Steve) Spider (Wyman) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sensitive Information Sweep

Sensitive Information Sweep

Using Cornell’s Spider

Wyman Miles, Cornell University

Kerry Havens, University of Colorado at Boulder

Steve Lovaas, Colorado State University

Page 2: Sensitive Information Sweep

Overview

• Quick Background

• The Technical Problem (Kerry)

• The Organizational Problem (Steve)

• Spider (Wyman)

• Summary & Questions

Page 3: Sensitive Information Sweep

What is “Sensitive Information”?

• A Growing Concern

• A Moving Target

• SSN, Credit Card, Driver’s License, Medical Records, Student Information, Proprietary Research,…

• Data in Context – Aggregation

Page 4: Sensitive Information Sweep

Why Are We All Here?

• The Front Page!

• CDW-G 2006 Survey – more than 3 million college students may have lost personal information in the last year.

• Identity theft is the fastest growing crime in the U.S.

• By far the biggest culprit? Lost or stolen computers.

Page 5: Sensitive Information Sweep

Regulations, Standards, & Laws

• Federal – HIPAA, FERPA, SarbOx, GLB,… Identity Theft Protection Act?

• State – Many states passing identity theft protection laws; New York & Colorado have state CISO

• Industry – PCIDSS

Page 6: Sensitive Information Sweep

The Technical Problem:Finding sensitive information in a

haystack

Kerry Havens

University of Colorado at Boulder

Page 7: Sensitive Information Sweep

SSN Remediation

• At CU-Boulder, SSNs were used as a student identifier before 2004

• House Bill 03-1175 was approved in 2003 requiring institutions to change this method to ensure the privacy of a student’s social security number

• CU-Boulder started issuing student IDs to new students in July 2004 and converting SSNs to SIDs in 2005

Page 8: Sensitive Information Sweep

Where the data is not stored

• File type exclusions – fine tuning– Binary files where the data cannot be read– Received input from community for fine tuning

• False positives– International telephone numbers– Examples for web form validation

• Why is the department webpage asking for SSNs?

Page 9: Sensitive Information Sweep

OS and File Encoding Problems

• HTML encoding problems• Representations (pictures) of sensitive

data are not found– Examples include PDF

• Searching a UNIX filesystem– Preparing the file before searching for private

data– For example, using strings to extract text from

text/binary hybrids like .doc or .xls

Page 10: Sensitive Information Sweep

Where the data is stored

• Typical file types of discovered data– Gradebooks– Course web pages– Homework assignments– Travel authorization forms– Personal financial documents– Email

Page 11: Sensitive Information Sweep

Regular Expressions

• Returns too much data: /\d{3}-\d{2}-\d{4}/

• Searching for environment specific data in the hope that common data will lead us to more data:/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

• State specific information can be found at

http://www.ssa.gov/employer/stateweb.htm

Page 12: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

Page 13: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

Boundary

Page 14: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

First acceptable digit

Page 15: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

2, 4, or 6 digits in a row

Page 16: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

Delimited by dash or space

Page 17: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

Colorado specific prefix, not delimited

Page 18: Sensitive Information Sweep

CU Experiences

• Pitfalls– Users’ interpretations of the log file– Fine tuning file extension exceptions and

regular expressions

• Recommendations– Keep current environment in mind

Page 19: Sensitive Information Sweep

The Organizational Problem:a really big haystack

Steve Lovaas

Network Security Manager

Colorado State University

Page 20: Sensitive Information Sweep

Organizational Vision

• Support from the top – Cabinet-level committee driving the project– Spurred by headlines and state mandates– VP for IT who really gets security

• Campus PR campaign– Web site– Public meetings

• Tied SSN purge to the rollout of a new CSUID in Fall 2006

Page 21: Sensitive Information Sweep

Using Resources

• Project Constraints– Tight timeline– No budget – Not a trivial programming project

• Buy / Build / Leverage tools?

• Goal: 100% coverage vs. Best Effort

• Spider chosen for Windows, Linux, Mac

• Manual searching on AIX, mainframe

Page 22: Sensitive Information Sweep

Ultimate Responsibility

• Original thought: deans / dept. heads

• Revised edition: individual employees

• Developed a personal attestation for for every employee to sign, submitted in bulk by colleges

• More work for central IT

• Senior VP: Doing the scan and signing the form is a CONDITION OF EMPLOYMENT

Page 23: Sensitive Information Sweep

Individual Attestation Form

• Every employee• 2 choices:

– I don’t interact with SSNs in the course of my job

– SSNs in all electronic files under my control have been removed or encrypted

• VP for IT must approve exceptions

Page 24: Sensitive Information Sweep

CSU Experiences

• Pitfalls– Beta tool for a live project requires quick response

and careful management of user expectations & acceptance

– Careful of deadlines, it’s a lot of work!

• Recommendations– Don’t do this kind of project without active support

from the very top– Anticipate the need for analysis/parsing tools– Have a supported encryption solution for exceptions

Page 25: Sensitive Information Sweep

Cornell Spider

Wyman Miles

Sr. Security Engineer

Cornell University

Page 26: Sensitive Information Sweep

A Brief History of Spider

• Early 2005, scan Web for SSNs

• Later, scan disk images for SSNs/CCNs

• March 2006, debut at BU Security Camp

• April 2006, Educause, demand for a Windows version

• Version 1.0 in May, 2.0 in June

Page 27: Sensitive Information Sweep

A Brief History, II

• June 2006, major feedback from Steve: bug reports, tests, feature requests

• Engine developed that same month: internal incident response

• OSX Spider Sept 2006

• Windows Spider rewrite

• April 2007, GPL release of all Spiders

Page 28: Sensitive Information Sweep

Current Spider

• SSN, SIN, CCN, NINO discovery in many file types

• Various data type validators

• Web scanning, back to its roots

• Scan for data in unallocated space

• Faster. More readable source

Page 29: Sensitive Information Sweep

Various Spiders

• Windows Spider, aka Spider3

• OSX Spider

• Engine, general UNIX spider

• LinSpider, our oldest version

• Spider Simple: Windows Spider preconfigured to skip noisy files

Page 30: Sensitive Information Sweep

Future Spider

• Feature set convergence between Engine, OSX, Windows

• Community Development

• Possible I2 hosting of distribution and documentation

• More documentation!

• Client-Server model revisited

Page 31: Sensitive Information Sweep

Spider Log

Page 32: Sensitive Information Sweep

Spider at Cornell

• Incident response: a compromise has happened, what was at risk?

• Pre-emptive– Dan Elswit, CALS Security Officer

Page 33: Sensitive Information Sweep

Spider in CIT

• CIT abandoned SSNs a few years ago, but they remain

• Tech support uses Spider Simple to discover lurking SSNs

• Manual process

Page 34: Sensitive Information Sweep

Athletics

• Spider Simple

• Unique log names to network share

• Centralized analysis

Page 35: Sensitive Information Sweep

Spider Downloads

• http://www.cit.cornell.edu/security/tools

Page 36: Sensitive Information Sweep

Summary

• Purging sensitive information is something we’re going to have to get good at

• Get support from the highest levels• Tune regular expressions and file/ext skip

lists for your environment• Anticipate parsing needs, exceptions• New Spider features, more users, broader

OS support• Spider also for ongoing support, forensics

Page 37: Sensitive Information Sweep

Questions?

• Wyman Miles:– [email protected]

• Kerry Havens:– [email protected]

• Steve Lovaas:– [email protected]

• The Spider users’ list:– [email protected]