set expansion

Post on 11-Apr-2017

156 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SET EXPANSION - Team 25

ROMIL PUNETHA DEEP GREWAL SANDEEP KASA201505568 201364124 201301145

OUTLINE

• Introduction• Related Work• Approach• Results• References

INTRODUCTION

• Set Expansion refers to completing a given set with relevant terms corresponding to the given “seed terms”.• The goal is to find other entities which could belong to

the same set as the given input entities.• For example : Input = mango, banana• Output= strawberry, apples, etc

RELATED WORK

• Google sets is a well known example of a web based set expansion system.• Language independent set expansion of named

entities using web.• Set expansion using web based crawling.

APPROACH

• Tool used : Word2vec- Finding similarity between words by converting them

into a feature vector and calculating the cosine distances between them.• distance = vector(word1)* vector(word2)• The following link explains the working of word2vec

- Word2Vec• Training of the model done using dataset from the

following link: - Training set for the word2vec model

Crawler and Indexer• Indexing word2vec dataset

- used word2vec.Text8Corpus function to create the model using the wiki set.• Web results form Google, Bing,

DuckDuckgo,etc have been used.• Crawled web pages to obtain patterns

containing seed terms (Explained in report).• Edited the python parser to parse specific

parts ofs the data from the web pages.

ALGORITHM

• Get web results using input seeds• Crawl the web pages to search for the seed terms within

tags.(used a heuristics based approach to identify relevant tags instead of focusing only on table, ul, li ,ol).• For each term in the seed set :

- if not stopword :i) find its cosine distance with each seed termii) if the word is also found using pattern matching,

push the intersecting terms higher in the output.• Display the top ‘n’ (10 here) results.

RESULTS

• Input : cricket, football,volleyball• Output : rugby• Soccer• Hockey• Squash• Badminton• Kabaddi• Bowling• Cricketers• tennis

RESULTS

• Input : Samsung, sony, hp• Output : tdk • Nokia• Microsoft• Video• Motorola• Oppo entertainment• Asus

RESULTS

• Input : java, python, Perl, php• Output : • JavaScript • scripting • mongo dB • linux• tcl • lisp • Cpan• Numpy• Doctest• gnu

RESULTS

• Input : mango, banana, orange• Output : papaya• Mangoes• Coconut• Pineapple• Tomato• Cashews• Lemon• Zucchini• cinnamon• watermelon

CONCLUSION

• In this project, we have shown how to expand a set using seed terms and the word2vec tool.• The program has been tested on various seed terms and

the results have been found to be perfectly acceptable.• Various web search APIs like google, bing, etc have been

used to tune the search results.

REFERENCES

• A Cross-Lingual dictionary for English Wikipedia Concepts.• https://www.cs.cmu.edu/afs/cs/Web/People/

wcohen/postscript/icdm-2007.pdf• word2vec tool for creating vectors of the words.• Identifying the Sets of Related words from World Wide

Web.• Entity List Completion using Set Expansion Technique.

PROJECT LINKS

• GitHub • Drobox• Presentation• Video• Website

top related