1 cantina : a content-based approach to detecting phishing web sites www 2007 2008.09.09 yue zhang,...
TRANSCRIPT
1
CANTINA : A Content-Based Approach to Detecting Phishing Web Sites
WWW 2007
2008.09.09
Yue Zhang , Jason Hong, and Lorrie Cranor
CS710 | KAIST
Agenda
Phishing Attacks Motivation & Goal Relative Work CANTINA Evaluation Conclusion
2
CS710 | KAIST
Phishing Attacks(1/2)
The Act of stealing personal information via the internet for the purpose of committing financial fraud Create a faked site similar to original sites like bank Send to users using variable methods
• Spam e-mail, XSS vulnerabilities, Malware … Technical issues
URL Obfuscation• Similar domain, Encoding URL…
DNS hijacking• Modifying hosts file, DNS server setting…
Malware• BHO(Browser Helper Object), Browser Toolbar, Key logger…
3
CS710 | KAIST
Phishing Attacks(2/2)
Criminals often create phishing sites by copying and then modifying a legitimate site’s web pages Similar to original web site
Often contain brand names and other terms that are common on a given web page Owner’s brands
4
CS710 | KAIST
Motivation & Goal
Phishing is a rapidly growing problem with 9,255 unique phishing sites reported in 2006
84 Anti-phishing toolbars Low accuracies There is a strong need for better automated detection
algorithms
A novel content-based approach for detecting phishing web sites. Accomplish the accuracy more than existing approach
5
CS710 | KAIST
Related work(1/3)
Anti-Phishing has four categories Why People Fall for Phishing Attacks?
• Have examined the reasons that people fall for phish-ing attacks
Educating people about Phishing Attacks• Focused on online training materials, testing and sit-
uated learning Anti-Phishing User Interface
• Focused on the development of better user interface for anti-phishing tools
Automated Detection of Phishing
6
CS710 | KAIST
Relative work(2/3)
Anti-Phishing user interface Toolbar-based approach
Browser extensions• Dynamic Security Skins• Web Wallet
7
CS710 | KAIST
Relative Work(3/3)
Automated detection of phishing To use heuristics to judge whether a page has phishing
characteristics.• Host name, domain name, URLs,…
To use a blacklist that lists reported phishing URLs
8
CS710 | KAIST
CANTINA | Basic Concept
Criminals often create phishing sites by copying and then modifying a legitimate site’s web pages Contain brand names and terms of legitimate pages
Robust Hyperlinks To find a broken links Add lexical signature to URLs
• If link doesn’t work, then feed signature to search engine• Ex. http://aaa.com/a.html?lexical-signature==“word1+word2+...+word5”
TF/IDF (Term frequency/Inverse document frequency) Frequency based algorithm. Basic algorithm for search engine
• comparing and classifying documents• A term has a high TF-IDF weight by having a high
term frequency in a given document
9
CS710 | KAIST
CANTINA | Basic Concept
10
Web pageCalculate TF-IDF weight of each term
Take the five terms with highest TF-IDF weight
Search top file term(term1+term2..) using google
Compare the domain name with google search results
Phishing site : domain name of current page do not match the domain name of the N top search results (30)
CS710 | KAIST
CANTINA | Additional Solutions
Basic CANTINA has a number of false positive
Solutions Add the current domain name to the lexical signature
ZMP(Zero results Means Phishing)• Google returns zero search results
– Meaningless domain(e.g., “u-s-j.be”)
Larger set of heuristics based on related work• From existing approach (e.g., SpoofGuard, PILFER)• Age of Domain, Known Images, Suspicious URL,…
14
CS710 | KAIST
Evaluation | Effectiveness #1(1/2)
Four conditions Basic TF-IDF Basic TF-IDF + domain name Basic TF-IDF + ZMP Basic TF-IDF + domain + ZMP
100 phishing URLs and 100 legitimate URLs Phishing URLs : PhishTank.com Legitimate URLs : From previous study
15
CS710 | KAIST
Evaluation | Effectiveness #1(2/2)
16
Basic TF-IDF + ZMP + domain False positives a little high Final TF-IDF
CS710 | KAIST
Evaluation | Effectiveness #2(1/2)
Want to reduce false positives Combining several heuristics method
17
CS710 | KAIST
Evaluation | Effectiveness #2(2/2)
Determining the best weights for these heuristics is a typical classification problem. Use a simple forward linear model Used 100 phishing URLs, 100 legitimate to find weights
18
CS710 | KAIST
Evaluation | Effectiveness #3(1/2)
To evaluate the effectiveness of Final-TF-IDF, Final-TD-IDF+heuristics, SpoofGuard, and Netcraft SpoofGuard : the highest true positive rate
• Relies entirely on heuristics Netcraft : one of the best toolbars overall
• Uses a combination of heuristics and an extensive blacklist.
100 phishing URLs from PhishTank.com 100 legitimate URLs
35 sites often attacked (citibank. Papayl) 35 top pages from Alexa ( most popular sites) 30 random web pages from random.yahoo.com
19
CS710 | KAIST
Evaluation | Effectiveness #3(2/2)
20
Reduced false positives from 6% to 1% by com-bining Final-TF-IDF with simple heuristics But, true positive was decreased
CS710 | KAIST
Discussion
Limitations Does not apply to non-English web sites System Performance
• Depend on performance of Google search engine
Attacks by criminals use image instead of words Add invisible text Circumventing TF-IDF and PageRank
• Using “Google Bombs” Attempt a DoS attack on Google
21
CS710 | KAIST
Conclusion
CANTINA uses TF-IDF + search engines + heuristics to find phishing web sites 97% true positives with 6% false positives 89% true positives with 1% false positives
Shifts problem of identifying phishing sites to a search en-gine problem
22