distilling the web of data drop by drop (with java)

Post on 15-Jan-2015

1.684 Views

Category:

Art & Photos

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

An introduction to the machine-readable Web and the various way to implement it with Microformats, RDFa or Microdata and how to consume it.Any23 is presented, a Java open source distiller for the Web of Data.

TRANSCRIPT

distilling the Web of Data drop by drop (with Java)

Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano

distilling the Web of Data drop by drop (with Java)

Wednesday, June 29, 2011

the shortest introduction ever to the Web o f Data

Web pages markup technologies are intended for human consumption

extracting valuable data may require fancy scraping techniques

scraping: one size doesn’t fit all

they let machines to present raw data to humans

Wednesday, June 29, 2011

the shortest introduction ever to the Web o f Data

<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up

<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>

</div></div>

Wednesday, June 29, 2011

<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up

<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>

</div></div>

the shortest introduction ever to the Web o f Data

what does this tag mean?

Wednesday, June 29, 2011

<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up

<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>

</div></div>

the shortest introduction ever to the Web o f Data

what does this tag mean?is this a

currency or what?

Wednesday, June 29, 2011

“meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, Italy Wednesday, June 29, 2011

Microformats

“Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages”

Andy Mabbett

- community driven initiative- largely adopted- quick & dirty - scarcely extensibility

Wednesday, June 29, 2011

Microformats

<div class=”hlisting item”> <div> Canon Rebel T2i (EOS 550D) $899< /div> <div class=”description”> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up

<div> AN_UCC-13: 013803123784 </div> <div class=”price”> price: 899 USD </div>

</div></div>

Wednesday, June 29, 2011

RDFa: RDF in attribute

model your data as they were Web pages connected with named links and properties and embed them in your (X)HTML using @attributes

- RDF, graph-based model- W3C Recommandation- highly extensible

i.e GoodRelations[1], a fully flavored vocabulary for the e-commerce

Wednesday, June 29, 2011

RDFa: RDF in attribute

<div about=”http://mystore.com/product/5642”> <div>Canon Rebel T2i (EOS 550D) $899</div> <div property=”gr:description”>The Rebel T2i EOS 550D is Cannon's blah blah</div>

<div rel=”gr:hasPriceSpecification”> <span> price: <span property=”gr:hasCurrencyValue”>899</span> <span property=”gr:hasCurrency”>USD</span> </span> </div></div>

and then embed them in your (X)HTML pages

Wednesday, June 29, 2011

HTML5: Microdata

Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content

- W3C Working draft- native of HTML5 specification- serializable in RDF

- Google, Yahoo! and Bing endorsed Schema.org- large adoption expected

Wednesday, June 29, 2011

HTML5: Microdata

<div itemscop itemtype=”http://schema.org/Offer”> <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899 </div> <div itemprop=”description”> The Rebel T2i EOS 550D is Cannon's blah blah</div>

<div> <span> price: <span itemprop=”price”> 899 </span> <span itemprop=”priceCurrency”> USD </span> </span> </div></div>

Wednesday, June 29, 2011

% of marked up Web pages

0

0.5

1

1.5

2

2.5

3

3.5

RDFahCard

adrxfn

hReview09/200803/200910/2010

data from Yahoo! [2]

Wednesday, June 29, 2011

tie ‘em all together

uniform, reconciled and unified RDF representation

Wednesday, June 29, 2011

a drop-by-drop distiller

Anything To Triples (any23) is an open source, Apache-licensed:

- Java library,- Web service and- a command-line tool

able to distill RDF triples from a variety of semantically marked up Web documents

http://developers.any23.org

Wednesday, June 29, 2011

live demo http://any23.org

Web site with ~5000 products description with GoodRelations using RDFa

Wednesday, June 29, 2011

use Any23 in your Java programs

Any23 runner = new Any23();runner.setHTTPUserAgent("test-user-agent");HTTPClient httpClient = runner.getHTTPClient();DocumentSource source = new HTTPDocumentSource(      httpClient,      "http://test.com/index.html"   );ByteArrayOutputStream out = new ByteArrayOutputStream();TripleHandler handler = new NTriplesWriter(out);runner.extract(source, handler);String n3 = out.toString("UTF-8");

Wednesday, June 29, 2011

Any23: Command-Line toolany23-core/bin$ ./any23

usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]       [-p] [-s] [-t] [-v] {<url>|<file>} -e <arg>            comma-separated list of extractors, e.g.                     rdf-xml,rdf-turtle -f,--format <arg>   Output format [turtle (default), ntriples, rdfxml, quad, uris] -l,--log <arg>      logging, please specify a file -n,--nesting        disable production of nesting triples -o,--output <arg>   ouput file (defaults to stdout) -p,--pedantic       validates and fixes HTML content detecting commons issues -s,--stats          print out statistics of Any23

Wednesday, June 29, 2011

Any23: Web Serviceblacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http://www.bbc.co.uk/programmes/b00kygwh&report=on

<response> <extractors> <extractor>rdf-xml</extractor> </extractors> <report> ... <validationReport> <ruleActivations></ruleActivations> ... </validationReport> </report> <data> <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b00kygwh#programme"> <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/> <po:pid>b00kygwh</po:pid> <dc:title>The Terminator</dc:title> </rdf:Description> </data></response>

Wednesday, June 29, 2011

RDFaExtractor

Microformat Extractors

hReview hCalendar hCard

mimetype detection

hListing

Apache Tika

MicrodataExtractor

DOM extraction

Validator

Cyber Neko HTML

Rule Fix

ExtractionResult

RDF/XML Writer

JSON Writer

NQuadsWriter

Sesame

Wednesday, June 29, 2011

extractorpublic interface Extractor<Input> {

/** * Executes the extractor. Will be invoked only once, extractors are * not reusable. * * @param in The extractor's input * @param documentURI The document's URI * @param out Sink for extracted data * @throws IOException On error while reading from the input stream * @throws ExtractionException On other error, such as parse errors */ void run(Input in, URI documentURI, ExtractionResult out) throws IOException, ExtractionException;

/** * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of * this extractor. */ ExtractorDescription getDescription();

}

Wednesday, June 29, 2011

validate and fixpublic interface Rule {

String getHRName();

boolean applyOn( DOMDocument document, RuleContext context, ValidationReportBuilder validationReportBuilder );}

public interface Fix { String getHRName(); void execute(Rule rule, RuleContext context, DOMDocument document); }

void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix);

Wednesday, June 29, 2011

plugins

@PluginImplementation  @Author(name="Michele Mostarda (mostarda@fbk.eu)")  public class HTMLScraperPlugin implements ExtractorPlugin {

    private static final Logger logger = LoggerFactory.getLogger(HTMLScraperPlugin.class);

    @Init    public void init() {        logger.info("Plugin initialization.");    }

    @Shutdown    public void shutdown() {        logger.info("Plugin shutdown.");    }

    public ExtractorFactory getExtractorFactory() {        return HTMLScraperExtractor.factory;    }

  }

Wednesday, June 29, 2011

roadmap

incoming 0.6.0 release- support for Microdata- support for CSV- support for RDFa 1.1 prefix mechanism- improved app configuration- bug fixing

Apache (pre) Incubation process- http://wiki.apache.org/incubator/Any23Proposal- supporters and mentors (thanks guys!)Simone Tripodi (@stripodi)Tommaso Teofili (@tteofili)

- we’re looking for mentors

Wednesday, June 29, 2011

closing credits

active committers

Giovanni Tummarello ( @jccq )Michele Mostarda ( @micmos )

Davide Palmisano ( @dpalmisano )Richard Cyganiak ( @cygri )

thanks to the whole Semantic Web community, especially those who tirelessly challenge us

with bugs and features requests

Wednesday, June 29, 2011

top related