distilling the web of data drop by drop (with java)

27
distilling the Web of Data drop by drop (with Java) Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano distilling the Web of Data drop by drop (with Java) Wednesday, June 29, 2011

Upload: davide-palmisano

Post on 15-Jan-2015

1.684 views

Category:

Art & Photos


0 download

DESCRIPTION

An introduction to the machine-readable Web and the various way to implement it with Microformats, RDFa or Microdata and how to consume it.Any23 is presented, a Java open source distiller for the Web of Data.

TRANSCRIPT

Page 1: distilling the Web of Data drop by drop (with Java)

distilling the Web of Data drop by drop (with Java)

Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano

distilling the Web of Data drop by drop (with Java)

Wednesday, June 29, 2011

Page 2: distilling the Web of Data drop by drop (with Java)

the shortest introduction ever to the Web o f Data

Web pages markup technologies are intended for human consumption

extracting valuable data may require fancy scraping techniques

scraping: one size doesn’t fit all

they let machines to present raw data to humans

Wednesday, June 29, 2011

Page 3: distilling the Web of Data drop by drop (with Java)

the shortest introduction ever to the Web o f Data

<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up

<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>

</div></div>

Wednesday, June 29, 2011

Page 4: distilling the Web of Data drop by drop (with Java)

<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up

<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>

</div></div>

the shortest introduction ever to the Web o f Data

what does this tag mean?

Wednesday, June 29, 2011

Page 5: distilling the Web of Data drop by drop (with Java)

<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up

<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>

</div></div>

the shortest introduction ever to the Web o f Data

what does this tag mean?is this a

currency or what?

Wednesday, June 29, 2011

Page 6: distilling the Web of Data drop by drop (with Java)

“meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, Italy Wednesday, June 29, 2011

Page 7: distilling the Web of Data drop by drop (with Java)

Microformats

“Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages”

Andy Mabbett

- community driven initiative- largely adopted- quick & dirty - scarcely extensibility

Wednesday, June 29, 2011

Page 8: distilling the Web of Data drop by drop (with Java)

Microformats

<div class=”hlisting item”> <div> Canon Rebel T2i (EOS 550D) $899< /div> <div class=”description”> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up

<div> AN_UCC-13: 013803123784 </div> <div class=”price”> price: 899 USD </div>

</div></div>

Wednesday, June 29, 2011

Page 9: distilling the Web of Data drop by drop (with Java)

RDFa: RDF in attribute

model your data as they were Web pages connected with named links and properties and embed them in your (X)HTML using @attributes

- RDF, graph-based model- W3C Recommandation- highly extensible

i.e GoodRelations[1], a fully flavored vocabulary for the e-commerce

Wednesday, June 29, 2011

Page 11: distilling the Web of Data drop by drop (with Java)

RDFa: RDF in attribute

<div about=”http://mystore.com/product/5642”> <div>Canon Rebel T2i (EOS 550D) $899</div> <div property=”gr:description”>The Rebel T2i EOS 550D is Cannon's blah blah</div>

<div rel=”gr:hasPriceSpecification”> <span> price: <span property=”gr:hasCurrencyValue”>899</span> <span property=”gr:hasCurrency”>USD</span> </span> </div></div>

and then embed them in your (X)HTML pages

Wednesday, June 29, 2011

Page 12: distilling the Web of Data drop by drop (with Java)

HTML5: Microdata

Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content

- W3C Working draft- native of HTML5 specification- serializable in RDF

- Google, Yahoo! and Bing endorsed Schema.org- large adoption expected

Wednesday, June 29, 2011

Page 13: distilling the Web of Data drop by drop (with Java)

HTML5: Microdata

<div itemscop itemtype=”http://schema.org/Offer”> <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899 </div> <div itemprop=”description”> The Rebel T2i EOS 550D is Cannon's blah blah</div>

<div> <span> price: <span itemprop=”price”> 899 </span> <span itemprop=”priceCurrency”> USD </span> </span> </div></div>

Wednesday, June 29, 2011

Page 14: distilling the Web of Data drop by drop (with Java)

% of marked up Web pages

0

0.5

1

1.5

2

2.5

3

3.5

RDFahCard

adrxfn

hReview09/200803/200910/2010

data from Yahoo! [2]

Wednesday, June 29, 2011

Page 15: distilling the Web of Data drop by drop (with Java)

tie ‘em all together

uniform, reconciled and unified RDF representation

Wednesday, June 29, 2011

Page 16: distilling the Web of Data drop by drop (with Java)

a drop-by-drop distiller

Anything To Triples (any23) is an open source, Apache-licensed:

- Java library,- Web service and- a command-line tool

able to distill RDF triples from a variety of semantically marked up Web documents

http://developers.any23.org

Wednesday, June 29, 2011

Page 17: distilling the Web of Data drop by drop (with Java)

live demo http://any23.org

Web site with ~5000 products description with GoodRelations using RDFa

Wednesday, June 29, 2011

Page 18: distilling the Web of Data drop by drop (with Java)

use Any23 in your Java programs

Any23 runner = new Any23();runner.setHTTPUserAgent("test-user-agent");HTTPClient httpClient = runner.getHTTPClient();DocumentSource source = new HTTPDocumentSource(      httpClient,      "http://test.com/index.html"   );ByteArrayOutputStream out = new ByteArrayOutputStream();TripleHandler handler = new NTriplesWriter(out);runner.extract(source, handler);String n3 = out.toString("UTF-8");

Wednesday, June 29, 2011

Page 19: distilling the Web of Data drop by drop (with Java)

Any23: Command-Line toolany23-core/bin$ ./any23

usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]       [-p] [-s] [-t] [-v] {<url>|<file>} -e <arg>            comma-separated list of extractors, e.g.                     rdf-xml,rdf-turtle -f,--format <arg>   Output format [turtle (default), ntriples, rdfxml, quad, uris] -l,--log <arg>      logging, please specify a file -n,--nesting        disable production of nesting triples -o,--output <arg>   ouput file (defaults to stdout) -p,--pedantic       validates and fixes HTML content detecting commons issues -s,--stats          print out statistics of Any23

Wednesday, June 29, 2011

Page 20: distilling the Web of Data drop by drop (with Java)

Any23: Web Serviceblacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http://www.bbc.co.uk/programmes/b00kygwh&report=on

<response> <extractors> <extractor>rdf-xml</extractor> </extractors> <report> ... <validationReport> <ruleActivations></ruleActivations> ... </validationReport> </report> <data> <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b00kygwh#programme"> <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/> <po:pid>b00kygwh</po:pid> <dc:title>The Terminator</dc:title> </rdf:Description> </data></response>

Wednesday, June 29, 2011

Page 21: distilling the Web of Data drop by drop (with Java)

RDFaExtractor

Microformat Extractors

hReview hCalendar hCard

mimetype detection

hListing

Apache Tika

MicrodataExtractor

DOM extraction

Validator

Cyber Neko HTML

Rule Fix

ExtractionResult

RDF/XML Writer

JSON Writer

NQuadsWriter

Sesame

Wednesday, June 29, 2011

Page 22: distilling the Web of Data drop by drop (with Java)

extractorpublic interface Extractor<Input> {

/** * Executes the extractor. Will be invoked only once, extractors are * not reusable. * * @param in The extractor's input * @param documentURI The document's URI * @param out Sink for extracted data * @throws IOException On error while reading from the input stream * @throws ExtractionException On other error, such as parse errors */ void run(Input in, URI documentURI, ExtractionResult out) throws IOException, ExtractionException;

/** * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of * this extractor. */ ExtractorDescription getDescription();

}

Wednesday, June 29, 2011

Page 23: distilling the Web of Data drop by drop (with Java)

validate and fixpublic interface Rule {

String getHRName();

boolean applyOn( DOMDocument document, RuleContext context, ValidationReportBuilder validationReportBuilder );}

public interface Fix { String getHRName(); void execute(Rule rule, RuleContext context, DOMDocument document); }

void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix);

Wednesday, June 29, 2011

Page 24: distilling the Web of Data drop by drop (with Java)

plugins

@PluginImplementation  @Author(name="Michele Mostarda ([email protected])")  public class HTMLScraperPlugin implements ExtractorPlugin {

    private static final Logger logger = LoggerFactory.getLogger(HTMLScraperPlugin.class);

    @Init    public void init() {        logger.info("Plugin initialization.");    }

    @Shutdown    public void shutdown() {        logger.info("Plugin shutdown.");    }

    public ExtractorFactory getExtractorFactory() {        return HTMLScraperExtractor.factory;    }

  }

Wednesday, June 29, 2011

Page 25: distilling the Web of Data drop by drop (with Java)

roadmap

incoming 0.6.0 release- support for Microdata- support for CSV- support for RDFa 1.1 prefix mechanism- improved app configuration- bug fixing

Apache (pre) Incubation process- http://wiki.apache.org/incubator/Any23Proposal- supporters and mentors (thanks guys!)Simone Tripodi (@stripodi)Tommaso Teofili (@tteofili)

- we’re looking for mentors

Wednesday, June 29, 2011

Page 26: distilling the Web of Data drop by drop (with Java)

closing credits

active committers

Giovanni Tummarello ( @jccq )Michele Mostarda ( @micmos )

Davide Palmisano ( @dpalmisano )Richard Cyganiak ( @cygri )

thanks to the whole Semantic Web community, especially those who tirelessly challenge us

with bugs and features requests

Wednesday, June 29, 2011