aftercollege self-service scrape configuration & posting utility kai hu haiyan wu may 14, 2009 @...

24
AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Upload: tyler-walsh

Post on 04-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

AfterCollegeSelf-Service Scrape Configuration & Posting Utility

Kai Hu

Haiyan Wu

May 14, 2009 @ Harney 235

Page 2: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Presentation Outline

Background & Motivation Goals Design Challenges Implementation Details Project Demonstration Future Extensions

04

/20

/23

2

Afte

rColle

ge S

crape U

tility

Page 3: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

AfterCollege Background

Customized career network for colleges & professional organizations across the country

Goal: Create a better way for job seeking students and alumni to connect with the right employer

04

/20

/23

3

Afte

rColle

ge S

crape U

tility

Page 4: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

What's Already There?

Manually created configuration files Crawler that runs periodically Job feed outputs to be posted online

04

/20

/23

4

Afte

rColle

ge S

crape U

tility

config.xml jobFeed.xml

Staff Black Widow crawler AfterCollege website

Page 5: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Limitations

Scalability Expensive to maintain Requires technical knowledge Supports only GET requests Unable to handle dynamic websites

04

/20

/23

5

Afte

rColle

ge S

crape U

tility

Page 6: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Design Overview

GUI Tool that assists staffs through configuration process

Web Proxy that captures user activities New Crawler that uses both DOM & String Pattern

matching

04

/20

/23

6

Afte

rColle

ge S

crape U

tility

config file jobFeed.xml

GUI Tool Web Proxy New Crawler

Json files

Page 7: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

04

/20

/23

7

Afte

rColle

ge S

crape U

tility

Design Overview

config file

Job feed

Page 8: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Advantages

Easing the pain; non-technical staff can also configure the system

Makes the configuration process more straight forward and easier to understand

Less expensive to maintain; take less than 10 minutes to reconfigure

Supports POST Possibility of extension to support more

complicated websites

04

/20

/23

8

Afte

rColle

ge S

crape U

tility

Page 9: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Challenges

Come up with easy-to-follow user interface Build a web proxy from scratch Distinguish patterns based on selected texts Develop crawler algorithm that handles job

information residing at different pages Deal with tricky Javascript Deal with embedded HTML pages Test crawling accuracy

04

/20

/23

9

Afte

rColle

ge S

crape U

tility

Page 10: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Design Decisions

FireFox Plugin vs. Web Proxy Integration with back-end Ability to add functionalities

Dojo vs. YUI Fade-In/Out, Drag & Drop Deals with different browsers Documentation

XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI

04

/20

/23

10

Afte

rColle

ge S

crape U

tility

Page 11: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details

GUI Tool JS inserted to each page YUI for user interface, as JS toolkit AJAX for communication with web proxy

Web Proxy Java Servlet Jetty as web/app server Apache HttpClient

Crawler Regular Expressions for Pattern Match Scrapes jobs in per-page, per-field basis

04

/20

/23

11

Afte

rColle

ge S

crape U

tility

Page 12: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details

04

/20

/23

12

Afte

rColle

ge S

crape U

tility

Add customized JavaScript to rendered HTML pages

Page 13: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details

04

/20

/23

13

Afte

rColle

ge S

crape U

tility

Rendered HTML source code

Page 14: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details

04

/20

/23

14

Afte

rColle

ge S

crape U

tility

Output content

Page 15: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details0

4/2

0/2

3

15

Afte

rColle

ge S

crape U

tility

Page 16: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details0

4/2

0/2

3

16

Afte

rColle

ge S

crape U

tility

Page 17: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details0

4/2

0/2

3

17

Afte

rColle

ge S

crape U

tility

Dom Pattern

Page 18: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details0

4/2

0/2

3

18

Afte

rColle

ge S

crape U

tility

String Pattern

Page 19: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Implementation Details0

4/2

0/2

3

19

Afte

rColle

ge S

crape U

tility

Page 20: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Project Demonstration

Page 21: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Future Extensions

Pagination Add support to crawl multiple pages

Tricky JavaScript Find solution to prevent redirection to different a domain

Embedded Pages Add functionality to get the HTML content of embedded

pages

04

/20

/23

21

Afte

rColle

ge S

crape U

tility

Page 22: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Resources

Course Instructor Dr. Jeff Buckwalter

Sponsor Steve Girolami, Perry Lee, & Saan Saeteurn

Source code control System Subversion

Wiki Site Knowledge share, work log, resource portal

- http://cs690.wikispaces.com/ Google group

Discussion and information exchange medium

- http://groups.google.com/group/desidae

04

/20

/23

22

Afte

rColle

ge S

crape U

tility

Page 23: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Questions?

Page 24: AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

Thank you!