a first experience in archiving the french web

26
A First Experience in A First Experience in Archiving the French Archiving the French Web Web Serge Abiteboul, Serge Abiteboul, Grégory Grégory Cobéna Cobéna (INRIA) (INRIA) Julien Masanès (BnF) Julien Masanès (BnF) Gérald Sédrati (Xyleme) Gérald Sédrati (Xyleme)

Upload: huslu

Post on 11-Jan-2016

26 views

Category:

Documents


2 download

DESCRIPTION

A First Experience in Archiving the French Web. Serge Abiteboul, Grégory Cobéna (INRIA) Julien Masanès (BnF) Gérald Sédrati (Xyleme). Organization. Web Archiving Dépôt Légal (legal deposit) Goal and scope Similar Projects Building the Archive Frontier of the French Web - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A First Experience in Archiving the French Web

A First Experience in A First Experience in Archiving the French WebArchiving the French Web

Serge Abiteboul, Serge Abiteboul, Grégory CobénaGrégory Cobéna (INRIA)(INRIA)

Julien Masanès (BnF)Julien Masanès (BnF)

Gérald Sédrati (Xyleme)Gérald Sédrati (Xyleme)

Page 2: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

OrganizationOrganization

Web ArchivingWeb Archiving Dépôt Légal (legal deposit)Dépôt Légal (legal deposit) Goal and scopeGoal and scope Similar ProjectsSimilar Projects

Building the ArchiveBuilding the Archive Frontier of the French WebFrontier of the French Web Site vs. Page archivingSite vs. Page archiving Data acquisitionData acquisition

Importance of pagesImportance of pages Site-based importanceSite-based importance New measuresNew measures Experiments on rankingExperiments on ranking

Representing ChangesRepresenting Changes

Page 3: A First Experience in Archiving the French Web

Web ArchivingWeb Archiving

Page 4: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Dépôt légal (legal deposit)Dépôt légal (legal deposit)

TodayToday Book are archived since 1537, a decision by King Francois the Book are archived since 1537, a decision by King Francois the

1st1st Web is an important and valuable source of informationWeb is an important and valuable source of information What is different? What is different?

The number of content providersThe number of content providers(≈300 000 sites publishers vs. 5000 traditional publishers)(≈300 000 sites publishers vs. 5000 traditional publishers)

The quantity of informationThe quantity of information(millions of pages, plus video and audio content)(millions of pages, plus video and audio content)

The quality of informationThe quality of information(lots of information is not meaningful)(lots of information is not meaningful)

The relationship with the editorsThe relationship with the editors(freedom of publication vs. traditional ‘push’ model)(freedom of publication vs. traditional ‘push’ model)

Updates and changes occur continuouslyUpdates and changes occur continuously The perimeter “the contents published in France’’The perimeter “the contents published in France’’

does not apply easily on the Webdoes not apply easily on the Web

Page 5: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Goal and ScopeGoal and Scope

Providing future generations with a Providing future generations with a representative archive of the cultural production representative archive of the cultural production for cultural, political, sociological studiesfor cultural, political, sociological studies etc. etc.

The mission is to archive a wide range of material The mission is to archive a wide range of material because nobody knows what will be of interest for because nobody knows what will be of interest for future researchfuture research

In traditional publication, publishers are filtering In traditional publication, publishers are filtering contents. The issue of selection comes contents. The issue of selection comes againagain

Page 6: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Similar ProjectsSimilar Projects

Human selection based approachHuman selection based approach select a few hundred sites and choose a periodicity of select a few hundred sites and choose a periodicity of

archivingarchiving Australia[15] and Canada[11]Australia[15] and Canada[11]

The Nordic experienceThe Nordic experience Use robot crawler to archive a significant part of the Use robot crawler to archive a significant part of the

surface websurface web Sweden, Finland, Norway [2]Sweden, Finland, Norway [2] Problems are:Problems are:

Lack of updates of archived pages between two snapshotsLack of updates of archived pages between two snapshots The deep or invisible Web [17,3]The deep or invisible Web [17,3]

Page 7: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Orientation of this Orientation of this experimentexperiment

Goals:Goals: Cover a large portion of the webCover a large portion of the web

Automatic content gathering is Automatic content gathering is necessarynecessary

Adapt robots to provide a Adapt robots to provide a continuous archiving facilitycontinuous archiving facility

Have frequent versions of the Have frequent versions of the sites, at least for the most sites, at least for the most “important” ones“important” ones

Research:Research: The notion of “important’’ sitesThe notion of “important’’ sites Building a coherent Web archiveBuilding a coherent Web archive Discover and manage important Discover and manage important

sources of deep Websources of deep Web

1 2

Page 8: A First Experience in Archiving the French Web

Building the Building the ArchiveArchive

Page 9: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

The frontier of the French The frontier of the French WebWeb

The perimeter of the french Web is:The perimeter of the french Web is:““contents edited in France”contents edited in France”

Many criteria may be used:Many criteria may be used: The French languageThe French language

-but many French sites use English-but many French sites use English-other French speaking countries or regions (e.g. Quebec) use -other French speaking countries or regions (e.g. Quebec) use FrenchFrench

Domain Name or resource locatorsDomain Name or resource locators.fr sites, but also .com or .org .fr sites, but also .com or .org

Address of the siteAddress of the sitephysical location of the web servers or address of the ownerphysical location of the web servers or address of the owner

Other criteria: BnF has little interest in commercial sitesOther criteria: BnF has little interest in commercial sites Pure librarian driven does not scale- Pure automatic does not Pure librarian driven does not scale- Pure automatic does not

work:work: The process should involve librarians and their expertiseThe process should involve librarians and their expertise

Page 10: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Site vs. Page archivingSite vs. Page archiving

The Web:The Web: Physical granularity = HTML pagesPhysical granularity = HTML pages

+layout, images, …+layout, images, … The problem is inconsistent data and linksThe problem is inconsistent data and links

Read page P ; one week later, read pages pointed by PRead page P ; one week later, read pages pointed by P– – may not exist anymoremay not exist anymore

Logical granularity?Logical granularity?

Snapshot view of a web siteSnapshot view of a web site What is a site?What is a site?

INRIA is INRIA is www.inria.frwww.inria.fr + www-rocq.inria.fr + … + www-rocq.inria.fr + … www.free.frwww.free.fr is hosting many different sites is hosting many different sites

There are technical issues (rapid firing, …)There are technical issues (rapid firing, …)

Page 11: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Data acquisitionData acquisition

CrawlCrawl For these experiments, we used Xyleme[19] crawlerFor these experiments, we used Xyleme[19] crawler

DiscoveryDiscovery Web is more than 2 billion pagesWeb is more than 2 billion pages French Web about 20 millions URLsFrench Web about 20 millions URLs First experiments using <*.fr>First experiments using <*.fr>

RefreshRefresh Based on the change rate of the dataBased on the change rate of the data

we use a site change rate based on the pages’ change ratewe use a site change rate based on the pages’ change rate Important pages are refreshed more oftenImportant pages are refreshed more often The change rate of pages is unknownThe change rate of pages is unknown

Page 12: A First Experience in Archiving the French Web

Importance of Importance of pagespages

Page 13: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

What is page importance?What is page importance?

““Le Louvre” homepage is more important than an Le Louvre” homepage is more important than an unknown person’s homepageunknown person’s homepage

Important pages are pointed by:Important pages are pointed by: Other important pagesOther important pages Many unimportant pagesMany unimportant pages

Can be compared to bibliographical referencesCan be compared to bibliographical references

This leads to Google[5] definition of PageRankThis leads to Google[5] definition of PageRank Based on the graph and links structureBased on the graph and links structure used with remarkable successused with remarkable success

Useful, but not sufficient for Web archiving.Useful, but not sufficient for Web archiving.We need to use other criteria as wellWe need to use other criteria as well

Page 14: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Page Importance Page Importance ComputationComputation

ImportanceImportance Link matrix LLink matrix L Page importance is the Page importance is the

fixpoint X of the equation L*X fixpoint X of the equation L*X = X= X

(i.e. important pages are pointed (i.e. important pages are pointed by important pages)by important pages)

Storing the link matrix and Storing the link matrix and computing page importance computing page importance uses lots of resourcesuses lots of resources

We developed[1] a new We developed[1] a new efficient technique to efficient technique to compute the fixpoint compute the fixpoint

Without having to store the Without having to store the Link matrixLink matrix

Technique adapts to Technique adapts to automatically to changesautomatically to changes

Page 15: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Using a stronger links Using a stronger links semanticsemantic

Limitations of page importanceLimitations of page importance Traditional page importance works well when links Traditional page importance works well when links

have a strong semantichave a strong semantic(e.g. the author links to web pages that he likes)(e.g. the author links to web pages that he likes)

More and more web pages are automatically More and more web pages are automatically generated and most links have little semanticsgenerated and most links have little semantics

Refresh at the page level presents drawbacksRefresh at the page level presents drawbacks So we use link topology between sites and So we use link topology between sites and

not only between pagesnot only between pages We also use the internal structure of a Web We also use the internal structure of a Web

site to determine which links are more site to determine which links are more importantimportant

Page 16: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Site-based importanceSite-based importance

The “Random Walk” model is used to determine the site The “Random Walk” model is used to determine the site internal links structure and assign an importance to each internal links structure and assign an importance to each linklink

From there we define links between Web sites as follows:From there we define links between Web sites as follows:

Using the standard importance definition, and the Using the standard importance definition, and the “random walk” model, the importance of a Web Site is “random walk” model, the importance of a Web Site is exactly the sum of its pages importanceexactly the sum of its pages importance

ZinqYinpYinppage

page qpLpI

pIZYL

,'

],[*]'[

][],['

Page 17: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

New criteria for importanceNew criteria for importance

Frequent use of infrequent wordsFrequent use of infrequent words

(Find pages dedicated to a specific topic)(Find pages dedicated to a specific topic)

Text WeightText Weight

(Find text pages with text content vs. raw data pages)(Find text pages with text content vs. raw data pages) OthersOthers

p

wpwp

wordeachf

f

w N

nfwhereI

web

wp ,,

,

Page 18: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Validation of the ‘notoriety’ Validation of the ‘notoriety’ parameter’parameter’

Blind experiment with 8 librariansBlind experiment with 8 librarians A list of 900 sites with notoriety A list of 900 sites with notoriety

parameter provided by Xylemeparameter provided by Xyleme 236 sites remained after exclusion of 236 sites remained after exclusion of

commercial sites and site no longer commercial sites and site no longer existing at the time of the testexisting at the time of the test

Page 19: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

DISCRIMINATION

-3

-2

-1

0

1

2

3

4

5

NOTORIETE

Random choice

Does the ranking correlate with the librarian’s choices ?

Page 20: A First Experience in Archiving the French Web

A model for Representing A model for Representing ChangesChanges

Move from a discrete snapshot-type archive to a more continuous one

Page 21: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Representing changesRepresenting changes

GoalsGoals Provide an historical Provide an historical

view of the Webview of the Web IssuesIssues

Have a persistent Have a persistent identification of web identification of web pages using their URL pages using their URL and date of crawland date of crawl

Support temporal Support temporal queries and provide queries and provide means to efficiently means to efficiently access dataaccess data

Handle mirror sites in Handle mirror sites in order to save resourcesorder to save resources

Page 22: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

The “Site-Delta” The “Site-Delta” representation of Changesrepresentation of Changes

The “site-delta” is an XML documentThe “site-delta” is an XML document It is used to manage metadata about It is used to manage metadata about

documents, and in particular temporal documents, and in particular temporal metadatametadata

Important aspects are:Important aspects are: storage efficiencystorage efficiency

Keep crawled information and no duplicataKeep crawled information and no duplicata Use diff to understand changes when possibleUse diff to understand changes when possible

management of versions and updatesmanagement of versions and updates Support for queries and browsingSupport for queries and browsing

Page 23: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Browsing the ArchiveBrowsing the Archive

The archive must be prepared in several steps:The archive must be prepared in several steps: Use local links instead of Internet linksUse local links instead of Internet links

(problems occur with javascript, with sessions, …)(problems occur with javascript, with sessions, …) Fix inconsistent data and linksFix inconsistent data and links Integrate the notion of time in linksIntegrate the notion of time in links

Advanced: summarize several snapshots of Advanced: summarize several snapshots of data into a single document?data into a single document? Consider for example the News site Consider for example the News site www.lemonde.frwww.lemonde.fr We want to give access to all news articles in We want to give access to all news articles in

January 2002 (and their versions)January 2002 (and their versions)

Page 24: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

Conclusion and ExperimentsConclusion and Experiments

A crawl of the web was performedA crawl of the web was performed We used between 2 to 8 PCs for Xyleme crawlersWe used between 2 to 8 PCs for Xyleme crawlers We looked at more than 1 billion (most interesting) pages We looked at more than 1 billion (most interesting) pages We discovered 15 million *.fr pages (about 1.5%)We discovered 15 million *.fr pages (about 1.5%) We discovered 150.000 *.fr sitesWe discovered 150.000 *.fr sites Discovery and refresh are based on page importanceDiscovery and refresh are based on page importance

Takes into account also the change rate of pagesTakes into account also the change rate of pages

We analyzed the relevance of page importance for We analyzed the relevance of page importance for librarianslibrarians Comparison with ranking by librariansComparison with ranking by librarians Strong correlation with their rankingsStrong correlation with their rankings

Next we plan to use classification and clustering techniques Next we plan to use classification and clustering techniques to refine the notion of siteto refine the notion of site

Page 25: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

MerciMerci

Page 26: A First Experience in Archiving the French Web

BDA2002 Evry - 23/10/BDA2002 Evry - 23/10/20022002

Grégory Cobéna (INRIA)Grégory Cobéna (INRIA)

ExampleExample

<website url=“<website url=“www.inria.frwww.inria.fr”>”><page url=“/index.html”><page url=“/index.html”>

<document date=“2002-jan-04” status=updated <document date=“2002-jan-04” status=updated file=“/data/fV453.htm”/>file=“/data/fV453.htm”/>

<document date=“2002-jan-22” status=updated <document date=“2002-jan-22” status=updated file=“/data/hX678.htm”/>file=“/data/hX678.htm”/>

<document date=“2002-mar-02” status=unchanged<document date=“2002-mar-02” status=unchangedfile=“/data/hX678.htm”/>file=“/data/hX678.htm”/>

……</page></page>……

</website></website>