aftercollege self-service scrape configuration & posting utility kai hu haiyan wu may 14, 2009 @...
TRANSCRIPT
AfterCollegeSelf-Service Scrape Configuration & Posting Utility
Kai Hu
Haiyan Wu
May 14, 2009 @ Harney 235
Presentation Outline
Background & Motivation Goals Design Challenges Implementation Details Project Demonstration Future Extensions
04
/20
/23
2
Afte
rColle
ge S
crape U
tility
AfterCollege Background
Customized career network for colleges & professional organizations across the country
Goal: Create a better way for job seeking students and alumni to connect with the right employer
04
/20
/23
3
Afte
rColle
ge S
crape U
tility
What's Already There?
Manually created configuration files Crawler that runs periodically Job feed outputs to be posted online
04
/20
/23
4
Afte
rColle
ge S
crape U
tility
config.xml jobFeed.xml
Staff Black Widow crawler AfterCollege website
Limitations
Scalability Expensive to maintain Requires technical knowledge Supports only GET requests Unable to handle dynamic websites
04
/20
/23
5
Afte
rColle
ge S
crape U
tility
Design Overview
GUI Tool that assists staffs through configuration process
Web Proxy that captures user activities New Crawler that uses both DOM & String Pattern
matching
04
/20
/23
6
Afte
rColle
ge S
crape U
tility
config file jobFeed.xml
GUI Tool Web Proxy New Crawler
Json files
04
/20
/23
7
Afte
rColle
ge S
crape U
tility
Design Overview
config file
Job feed
Advantages
Easing the pain; non-technical staff can also configure the system
Makes the configuration process more straight forward and easier to understand
Less expensive to maintain; take less than 10 minutes to reconfigure
Supports POST Possibility of extension to support more
complicated websites
04
/20
/23
8
Afte
rColle
ge S
crape U
tility
Challenges
Come up with easy-to-follow user interface Build a web proxy from scratch Distinguish patterns based on selected texts Develop crawler algorithm that handles job
information residing at different pages Deal with tricky Javascript Deal with embedded HTML pages Test crawling accuracy
04
/20
/23
9
Afte
rColle
ge S
crape U
tility
Design Decisions
FireFox Plugin vs. Web Proxy Integration with back-end Ability to add functionalities
Dojo vs. YUI Fade-In/Out, Drag & Drop Deals with different browsers Documentation
XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI
04
/20
/23
10
Afte
rColle
ge S
crape U
tility
Implementation Details
GUI Tool JS inserted to each page YUI for user interface, as JS toolkit AJAX for communication with web proxy
Web Proxy Java Servlet Jetty as web/app server Apache HttpClient
Crawler Regular Expressions for Pattern Match Scrapes jobs in per-page, per-field basis
04
/20
/23
11
Afte
rColle
ge S
crape U
tility
Implementation Details
04
/20
/23
12
Afte
rColle
ge S
crape U
tility
Add customized JavaScript to rendered HTML pages
Implementation Details
04
/20
/23
13
Afte
rColle
ge S
crape U
tility
Rendered HTML source code
Implementation Details
04
/20
/23
14
Afte
rColle
ge S
crape U
tility
Output content
Implementation Details0
4/2
0/2
3
15
Afte
rColle
ge S
crape U
tility
Implementation Details0
4/2
0/2
3
16
Afte
rColle
ge S
crape U
tility
Implementation Details0
4/2
0/2
3
17
Afte
rColle
ge S
crape U
tility
Dom Pattern
Implementation Details0
4/2
0/2
3
18
Afte
rColle
ge S
crape U
tility
String Pattern
Implementation Details0
4/2
0/2
3
19
Afte
rColle
ge S
crape U
tility
Project Demonstration
Future Extensions
Pagination Add support to crawl multiple pages
Tricky JavaScript Find solution to prevent redirection to different a domain
Embedded Pages Add functionality to get the HTML content of embedded
pages
04
/20
/23
21
Afte
rColle
ge S
crape U
tility
Resources
Course Instructor Dr. Jeff Buckwalter
Sponsor Steve Girolami, Perry Lee, & Saan Saeteurn
Source code control System Subversion
Wiki Site Knowledge share, work log, resource portal
- http://cs690.wikispaces.com/ Google group
Discussion and information exchange medium
- http://groups.google.com/group/desidae
04
/20
/23
22
Afte
rColle
ge S
crape U
tility
Questions?
Thank you!