1 enabling webscale research in europe julien masanès european archive foundation...

12
1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation [email protected] Consultation Workshop, Brussels, 19/1/2010

Upload: marshall-parrish

Post on 29-Dec-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

1

Enabling Webscale Research in Europe

Julien Masanès

European Archive Foundation

[email protected]

Consultation Workshop, Brussels, 19/1/2010

Page 2: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

2

‘Webtables’

Using Over 14 billions pages from the web to extract tables. 

[1] M. Cafarella, A. Halevy, D. Wang, E. Wu, et Y. Zhang, “WebTables: exploring the power of tables on the web,” Proc. VLDB Endow., vol. 1, 2008, pp. 549, 538.

Page 3: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

3

Data driven science

•Here is the evidence, now what is the hypothesis?

•«A now-common strategy in post-genomic biology is to measure, quantitatively, the action of all (or as many as possible) of the genes at the level of the transcriptome, proteome, metabolome and phenotype , and to use computerised methods to infer gene function via various kinds of pattern recognition techniques»

D.B. Kell et S.G. Oliver, “Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic

era,” BioEssays, vol. 26, 2004, pp. 99-105.  

Page 4: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

4

Webscale data

•The web represents a unique source of access to media content of all sorts, that a growing number of scientific communities, agencies and industries are starting to need to mine at large scale.

•The ability to acquire, process and mine large scale data from the web is becoming a strategic advantage in many domains from business intelligence to epidemiological tracking and monitoring.

Page 5: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

5

Research engine

•Key infrastructure to monitor and analyze the evolution of networked media

•More broadly, will become a key tool for research in more and more domains:

•low noise signal of ecological evolution, •economical trends, •emergence of new research of new research fields •tracking of reputation on the web •etc.

Page 6: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

6

An example:Ecological monitoring

Victor Galaz, Beatrice Crona, Tim Daw, Örjan Bodin, Magnus Nyström, et Per Olsson, “Can web crawlers revolutionize ecological monitoring?,”

Mar. 2009.

•«mining the internet to detect “early-warning” signs that may signal abrupt ecological changes»

Page 7: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

7

Who can do research on Webscale data?

•Webscale is already proving to be a challenge for many research group as the infrastructure, the cost and the skills required represent a significant barrier to entry.

•But when it comes to doing this through time, all but a few (mainly large search engines) can do it at all.

•In other words, only large search engines (none being European) are able to do research at this scale, hence comforting their advance by developing and testing new algorithms for search, ranking, mining etc.

Page 8: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

8

Research challenges (1)

• Building in europe of an open, neutral and sustainable virtual observatory of the web for research requires:

• large scale crawling, storage and indexing of web data (10+ Petabytes), not limited to text.

• We know the TB, not the PB yet.

Page 9: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

9

Research challenges (2)

• Create a baseline distributed analytics services (large scale IE, NLP, distributed and efficient processing and storage).

• We need to standardize and define baseline in this domain to create a platform for MMSE, social media research etc.

• Hadoop-style abstractions over internet-wide repository/processing clouds

• Optimized data placement (partitioning and replication) for analytics

• Distributed indices

Page 10: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

10

Research challenges (3)

• Temporal indexing of significant characteristics of networked content (from distribution to semantic)

• Large spectrum of research in IE/IR, network topology etc.

• Last but not least: make this infrastructure acceptable by society (respect privacy, transparence, IP rights)

Page 11: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

11

Thanks

Julien Masanès

European Archive Foundation

[email protected]

Consultation Workshop, Brussels, 19/1/2010

Page 12: 1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation julien@europarchive.org Consultation Workshop, Brussels, 19/1/2010

12

• M. Toyoda et M. Kitsuregawa, A system for visualizing and analyzing the evolution of the web with a time series of graphs, Salzburg, Austria: ACM Press New York, NY, USA, 2005.