semantic framework for web scraping

34
SEMANTIC SCRAPING MODEL FOR WEB RESOURCES by SHYJAL RAAZI

Upload: shyjal-raazi

Post on 13-May-2015

1.875 views

Category:

Technology


2 download

DESCRIPTION

3 level model for semantic screen scraping of web resources. Also Best practices, famous tools and a case study for web scraping is discussed.

TRANSCRIPT

Page 1: Semantic framework for web scraping

SEMANTIC SCRAPING MODEL FOR WEB RESOURCES

by

SHYJAL RAAZI

Page 2: Semantic framework for web scraping

AGENDA

What is scraping Why we scrape Where it is used More on XPATH and RDF Levels of scraping

1. Scraping service level2. Syntactic level3. Semantic level

Case study Tools Best practices Challenges

Page 3: Semantic framework for web scraping

Scraping : converting unstructured documents into structured

information or simply web content mining

Page 4: Semantic framework for web scraping

More.. Any program that retrieves structured data from the web, and then

transforms it to conform with a different structure.

Isn’t that just ETL? (extract, transform, load), or cant we regex.

Nope. because ETL implies that there are rules and expectations, and

these two things don’t exist in the world of web data. They can change

the structure of their dataset without telling you, or even take the

dataset down.

Page 5: Semantic framework for web scraping

Why Scraping?

Data is usually not in format we expect.

Get what you are interested in.

Web pages contain wealth of information (text form), designed mostly

for human Consumption

Interfacing with 3rd party that have no API access

Websites are more accurate than API’s

No IP rate limiting

Anonymous access

Page 6: Semantic framework for web scraping

Where it is used

Developers use it to interface API

Mining Web content

Online adverts

RSS readers

Web browsers

Page 7: Semantic framework for web scraping

Related terms

XML : A markup language that defines a set of rules for encoding documents in a

format that is both human and machine readable

RSS : RSS feeds enable web publishers provide summary/update of data

automatically. It can be used for receiving timely updates from news or blog

websites.

RDF :The Resource Description Framework (RDF) is a W3C standard for

describing Web resources, such as the title, author, modification date, content,

and copyright information of a Web page.

XPATH :is a query language used to navigate through elements and attributes in

an XML document.

Page 8: Semantic framework for web scraping

More on Resource Description Framework

• RDF is a framework for describing resources on the web.• RDF is designed to be read and understood by computers• Similar to entity relationship model.• RDF is written in XML.• RDF is based upon the idea of making statements about resources (in

particular web resources) in the form of subject-predicate-object expressions.• The notion "The sky has the color blue" in RDF is as the triple:

a subject denoting "the sky", a predicate denoting "has the color", and an object denoting "blue”• A collection of RDF statements intrinsically represents a labeled,

directed multi-graph

Page 9: Semantic framework for web scraping

The objects are:• "Eric Miller"(predicate : "whose

name is"),• [email protected] (predicate "whose

email address is"), • "Dr." (predicate : "whose title is").The subject is a URI.The predicates also have URIs. For example, the URI for each predicate:• "whose name is" is

http://www.w3.org/2000/10/swap/pim/contact#fullName,• "whose email address is" is

http://www.w3.org/2000/10/swap/pim/contact#mailbox,• "whose title is" is

http://www.w3.org/2000/10/swap/pim/contact#personalTitle.

Page 10: Semantic framework for web scraping

More on XPATH• XPATH uses path expressions to select nodes or node-sets in an XML

document. • XPATH includes over 100 built-in functions. There are functions for string

values, numeric values, date manipulation and time comparison, node and Name manipulation, sequence, Boolean values, and more.

<?xml version="1.0" encoding="ISO-8859-1"?><bookstore> <book> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> </book></bookstore>

<bookstore> (root element node)<author>J K. Rowling</author> (element node)lang="en" (attribute node)J K. Rowling (atomic value)

Page 11: Semantic framework for web scraping

<bookstore>

<book category="COOKING"> <title lang="en">Italian</title> <author>Giada </author> <year>2005</year> <price>30.00</price></book>

<book category="CHILDREN"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price></book>

</bookstore>

• Select all the titles “/bookstore/book/title”

• Select price nodes with price>35“/bookstore/book[price>35]/price”

• Select the title of the first book “/bookstore/book[1]/title”

Page 12: Semantic framework for web scraping

SCRAPING Framework

Model considers three level abstraction for an integrated model for semantic scraping

Page 13: Semantic framework for web scraping

#1 : Syntactic scraping level.

This level gives support to the interpretation to the semantic scraping model. It defines the required technologies to extract data from web resources. Wrapping and Extraction techniques such as DOM selectors are defined at this level for their use by the semantic scraping level.

Page 14: Semantic framework for web scraping

Techniques in syntactic level

Content Style Sheet selectors.

XPATH selectors.

URI patterns.

Visual selectors.

Page 15: Semantic framework for web scraping

Syntactic cont..

Selectors at the syntactic scraping level allow to identify HTML nodes. Either a generic element or an identified element can be selected using these techniques. Their semantics are defined in the next scraping level, allowing to map data in HTML fragments to RDF resources.

Page 16: Semantic framework for web scraping

#2 : Semantic scraping level.

This level defines a model that maps HTML fragments to semantic web resources. By using this model to define the mapping of a set of web resources, the data from the web is made available as knowledge base to scraping services.

• Apply the model to the definition of extractors of web resources.

• The proposed vocabulary serves as link between HTML document’s data and RDF data by defining a model for scraping agents. With this RDF model, it is possible to build an RDF graph of HTML nodes given an HTML document, and connects the top and lowest levels in the scraping framework to the semantic scraping level.

Page 17: Semantic framework for web scraping

Semantic scraping cont..

Page 18: Semantic framework for web scraping

#3 : Scraping service level.

This level comprises services that make use of semantic data extracted from un annotated web resources. Possible services that benefit from using this kind of data can be opinion miners, recommenders, mashups that index and filter pieces of news, etc.

Scraping technologies allow getting wider access to data from the web for these kinds of services.

Page 19: Semantic framework for web scraping

Make service

Scraping data identification.

Data modelling.

Extractor generalization.

Page 20: Semantic framework for web scraping

Case study

Scenario : has the goal of showing the most commented sports newson a map, according to the place they were taken.

Page 21: Semantic framework for web scraping

Challenges :• The lack of semantic annotations in the sports news web sites, • The potential semantic mismatch among these sites• The potential structural mismatch among sites.• Sites does not provide microformats, and do not include some

relevant information in their RSS feeds, such as location, users’ comments or ratings

Approach :• Defining the data schema to be extracted from selected sports news

web sites, • Defining and implementing these Extractors/Scrapers.Recursive access is needed for some resources. For instance, a piece of news may show up as a title and a brief summary in a newspaper’s homepage, but offers the whole content (including location, authors, and more) in its own URL.• Defining the mashup by specifying the sources

Page 22: Semantic framework for web scraping

Case study visualization

Page 23: Semantic framework for web scraping

Other scrape tools

Beautiful soup Mechanize Firefinder http://open.dapper.net by yahoo

Page 24: Semantic framework for web scraping
Page 25: Semantic framework for web scraping
Page 26: Semantic framework for web scraping

Visual scraper : firefinder

Page 27: Semantic framework for web scraping

Best practices

Page 28: Semantic framework for web scraping

#1:Approximate

webbehavior

Page 29: Semantic framework for web scraping

#2Batch jobs

in non peak hours

Page 30: Semantic framework for web scraping

Challenges

External sites can change without warning.

Figuring out the frequency is difficult, and changes can break scrapers easily

Bad HTTP status codes

Cookie check, Check referrer

Messy HTML markup

Data Piracy

Page 31: Semantic framework for web scraping

Conclusion

• With plain text, we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal.• The problem behind web information extraction and screen scraping has

been outlined, while the main approaches to it have been summarized. The lack of an integrated framework for scraping data from the web has been identified as a problem, and presents a framework that tries to fill this gap.• Developer can have an API for each and every websites.

Page 32: Semantic framework for web scraping

References

A SEMANTIC SCRAPING MODEL FOR WEB RESOURCES By Jose´ Ignacio Ferna´ndez-Villamor, Jacobo Blasco-Garc´ıa, Carlos A´ . Iglesias, Garijo

Page 33: Semantic framework for web scraping

THANK YOU

Page 34: Semantic framework for web scraping