1 enabling webscale research in europe julien masanès european archive foundation...

Post on 29-Dec-2015

217 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Enabling Webscale Research in Europe

Julien Masanès

European Archive Foundation

julien@europarchive.org

Consultation Workshop, Brussels, 19/1/2010

2

‘Webtables’

Using Over 14 billions pages from the web to extract tables. 

[1] M. Cafarella, A. Halevy, D. Wang, E. Wu, et Y. Zhang, “WebTables: exploring the power of tables on the web,” Proc. VLDB Endow., vol. 1, 2008, pp. 549, 538.

3

Data driven science

•Here is the evidence, now what is the hypothesis?

•«A now-common strategy in post-genomic biology is to measure, quantitatively, the action of all (or as many as possible) of the genes at the level of the transcriptome, proteome, metabolome and phenotype , and to use computerised methods to infer gene function via various kinds of pattern recognition techniques»

D.B. Kell et S.G. Oliver, “Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic

era,” BioEssays, vol. 26, 2004, pp. 99-105.  

4

Webscale data

•The web represents a unique source of access to media content of all sorts, that a growing number of scientific communities, agencies and industries are starting to need to mine at large scale.

•The ability to acquire, process and mine large scale data from the web is becoming a strategic advantage in many domains from business intelligence to epidemiological tracking and monitoring.

5

Research engine

•Key infrastructure to monitor and analyze the evolution of networked media

•More broadly, will become a key tool for research in more and more domains:

•low noise signal of ecological evolution, •economical trends, •emergence of new research of new research fields •tracking of reputation on the web •etc.

6

An example:Ecological monitoring

Victor Galaz, Beatrice Crona, Tim Daw, Örjan Bodin, Magnus Nyström, et Per Olsson, “Can web crawlers revolutionize ecological monitoring?,”

Mar. 2009.

•«mining the internet to detect “early-warning” signs that may signal abrupt ecological changes»

7

Who can do research on Webscale data?

•Webscale is already proving to be a challenge for many research group as the infrastructure, the cost and the skills required represent a significant barrier to entry.

•But when it comes to doing this through time, all but a few (mainly large search engines) can do it at all.

•In other words, only large search engines (none being European) are able to do research at this scale, hence comforting their advance by developing and testing new algorithms for search, ranking, mining etc.

8

Research challenges (1)

• Building in europe of an open, neutral and sustainable virtual observatory of the web for research requires:

• large scale crawling, storage and indexing of web data (10+ Petabytes), not limited to text.

• We know the TB, not the PB yet.

9

Research challenges (2)

• Create a baseline distributed analytics services (large scale IE, NLP, distributed and efficient processing and storage).

• We need to standardize and define baseline in this domain to create a platform for MMSE, social media research etc.

• Hadoop-style abstractions over internet-wide repository/processing clouds

• Optimized data placement (partitioning and replication) for analytics

• Distributed indices

10

Research challenges (3)

• Temporal indexing of significant characteristics of networked content (from distribution to semantic)

• Large spectrum of research in IE/IR, network topology etc.

• Last but not least: make this infrastructure acceptable by society (respect privacy, transparence, IP rights)

11

Thanks

Julien Masanès

European Archive Foundation

julien@europarchive.org

Consultation Workshop, Brussels, 19/1/2010

12

• M. Toyoda et M. Kitsuregawa, A system for visualizing and analyzing the evolution of the web with a time series of graphs, Salzburg, Austria: ACM Press New York, NY, USA, 2005.  

top related