finding more bilingual webpages with high credibility via link analysis chengzhi zhang, nanjing...
TRANSCRIPT
Finding More Bilingual Webpages
with High Credibility
via Link Analysis
Chengzhi Zhang , Nanjing University of Science and Technology
Xuchen Yao , Johns Hopkins University
Chunyu Kit , City University of Hong Kong
8 August 2013
BUCC2013, Sofia, Bulgaria
3 ideas
• Bilingual URL Pattern Detection
• Deep Webpage Recovery
• Incremental Bilingual Website Exploration
Bilingual URL Pattern Detection
• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm
• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)
• U is the set of all URLs within a website• approach: inverted index for URLs
– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...
– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>
– supports multiple languages• better mining multilingual websites such as EU and UN
Bilingual URL Pattern Detection
• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm
• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)
• U is the set of all URLs within a website• approach: inverted index for URLs
– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...
– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>
– supports multiple languages• better mining multilingual websites such as EU and UN
Bilingual URL Pattern Detection
• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm
• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)
• U is the set of all URLs within a website• approach: inverted index for URLs
– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...
– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>
– supports multiple languages• better mining multilingual websites such as EU and UN
Bilingual URL Pattern Detection
• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm
• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)
• U is the set of all URLs within a website• approach: inverted index for URLs
– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...
– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>
– supports multiple languages• better mining multilingual websites such as EU and UN
3 ideas
• Bilingual URL Pattern Detection
• Deep Webpage Recovery
• Incremental Bilingual Website Exploration
Deep Webpage Recovery
• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions
• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,
<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...
Deep Webpage Recovery
• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions
• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,
<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...
Deep Webpage Recovery
• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions
• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,
<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...
Deep Webpage Recovery
• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions
• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,
<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...
3 ideas
• Bilingual URL Pattern Detection
• Deep Webpage Recovery
• Incremental Bilingual Website Exploration
Incremental Bilingual Website Exploration
• Intuition: bilingual websites tend to link to other bilingual websites.
• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)
– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)
Incremental Bilingual Website Exploration
• Intuition: bilingual websites tend to link to other bilingual websites.
• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)
– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)
Incremental Bilingual Website Exploration
• Intuition: bilingual websites tend to link to other bilingual websites.
• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)
– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)
Incremental Bilingual Website Exploration
• Intuition: bilingual websites tend to link to other bilingual websites.
• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)
– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)
Discovering related webistes from seed websites(select the top K most related websites)
[Linkout, PageRank, WeightedPageRank]