restricted search engine laurent balat christophe decis thomas forey sebastien leclercq essi2...

15
Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Upload: marvin-reynolds

Post on 03-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Restricted Search Engine

Laurent Balat

Christophe Decis

Thomas Forey

Sebastien Leclercq

ESSI2 Project

Supervisor: Johny BOND

June 2002

Page 2: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Introduction(1)

• What is a search engine?

• 3 types:– disciplinary– global– thematic

• Internet users spend more than 50% of their time to search!

Page 3: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Introduction (2)

• Lots of pages can’t be reached.

WEB

Indexable WEB Google

Page 4: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

How does it work ?

• The search engine is composed of two parts

First processing : the WEB site spider

WEB Spider Processing

indexing

PDFunitDOC

unitHTMLprocessing

unit

DATABASE

Constraint

Page 5: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

How does it work ?

• User part architecture

DATABASEQuery engine

Query Interface

User

Page 6: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Constraints

• Domain Restriction.

• Search depth.

• Theme: words accepted or not.

• Document type.

• Time delay.

Page 7: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

The Spider Part

Check if link already visited

Check type data in constraints

Error download

HTTP HEADlink

linkpriority queue

Stackdata pagePush pageDownload

Page 8: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Document Processing

• Analyse of type• Send to the appropriate unit.• Extract words and links• Trying to resolve bad links

Page 9: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Indexation

• Binary Search Tree:- quick building- efficient use

• Check constraints:- start list and stop list.

Page 10: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Database

• MySQL database.• General Structure:

KeywordsWeb links

Correspondencebetween keywords and links

Page 11: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

User interface and query engine

• The web page is generated by a script (cgi).

• The query engine questions the database

• Formatting the results

Page 12: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Demonstration (1)• Fill the Database

Page 13: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Demonstration (2)

• How to search pages?

Page 14: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

Conclusion

• Results and perspective– Original search engine.– Easy to improve by adding units to process

differents file format (ps, doc, xls,…).• Team working and repartition. • This Project shows us how to use the

different tools seen this year

Page 15: Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002

References

http://www.w3c.org

http://www.mysql.com

http://www.sgi.com/tech/stl

http://www.searchengineshowdown.com