distilling the web of data drop by drop (with java)
DESCRIPTION
An introduction to the machine-readable Web and the various way to implement it with Microformats, RDFa or Microdata and how to consume it.Any23 is presented, a Java open source distiller for the Web of Data.TRANSCRIPT
distilling the Web of Data drop by drop (with Java)
Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano
distilling the Web of Data drop by drop (with Java)
Wednesday, June 29, 2011
the shortest introduction ever to the Web o f Data
Web pages markup technologies are intended for human consumption
extracting valuable data may require fancy scraping techniques
scraping: one size doesn’t fit all
they let machines to present raw data to humans
Wednesday, June 29, 2011
the shortest introduction ever to the Web o f Data
<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>
</div></div>
Wednesday, June 29, 2011
<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>
</div></div>
the shortest introduction ever to the Web o f Data
what does this tag mean?
Wednesday, June 29, 2011
<div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div>
</div></div>
the shortest introduction ever to the Web o f Data
what does this tag mean?is this a
currency or what?
Wednesday, June 29, 2011
“meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, Italy Wednesday, June 29, 2011
Microformats
“Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages”
Andy Mabbett
- community driven initiative- largely adopted- quick & dirty - scarcely extensibility
Wednesday, June 29, 2011
Microformats
<div class=”hlisting item”> <div> Canon Rebel T2i (EOS 550D) $899< /div> <div class=”description”> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div> <div class=”price”> price: 899 USD </div>
</div></div>
Wednesday, June 29, 2011
RDFa: RDF in attribute
model your data as they were Web pages connected with named links and properties and embed them in your (X)HTML using @attributes
- RDF, graph-based model- W3C Recommandation- highly extensible
i.e GoodRelations[1], a fully flavored vocabulary for the e-commerce
Wednesday, June 29, 2011
RDFa: RDF in attribute
http://mystore.com/product/5642
http://canon.co.uk
ex:price
ex:description
ex:currency
899
USD
The Rebel T2i EOS 550D blah blah
ex:producer
ex:value
model your data
Wednesday, June 29, 2011
RDFa: RDF in attribute
<div about=”http://mystore.com/product/5642”> <div>Canon Rebel T2i (EOS 550D) $899</div> <div property=”gr:description”>The Rebel T2i EOS 550D is Cannon's blah blah</div>
<div rel=”gr:hasPriceSpecification”> <span> price: <span property=”gr:hasCurrencyValue”>899</span> <span property=”gr:hasCurrency”>USD</span> </span> </div></div>
and then embed them in your (X)HTML pages
Wednesday, June 29, 2011
HTML5: Microdata
Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content
- W3C Working draft- native of HTML5 specification- serializable in RDF
- Google, Yahoo! and Bing endorsed Schema.org- large adoption expected
Wednesday, June 29, 2011
HTML5: Microdata
<div itemscop itemtype=”http://schema.org/Offer”> <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899 </div> <div itemprop=”description”> The Rebel T2i EOS 550D is Cannon's blah blah</div>
<div> <span> price: <span itemprop=”price”> 899 </span> <span itemprop=”priceCurrency”> USD </span> </span> </div></div>
Wednesday, June 29, 2011
% of marked up Web pages
0
0.5
1
1.5
2
2.5
3
3.5
RDFahCard
adrxfn
hReview09/200803/200910/2010
data from Yahoo! [2]
Wednesday, June 29, 2011
tie ‘em all together
uniform, reconciled and unified RDF representation
Wednesday, June 29, 2011
a drop-by-drop distiller
Anything To Triples (any23) is an open source, Apache-licensed:
- Java library,- Web service and- a command-line tool
able to distill RDF triples from a variety of semantically marked up Web documents
http://developers.any23.org
Wednesday, June 29, 2011
live demo http://any23.org
Web site with ~5000 products description with GoodRelations using RDFa
Wednesday, June 29, 2011
use Any23 in your Java programs
Any23 runner = new Any23();runner.setHTTPUserAgent("test-user-agent");HTTPClient httpClient = runner.getHTTPClient();DocumentSource source = new HTTPDocumentSource( httpClient, "http://test.com/index.html" );ByteArrayOutputStream out = new ByteArrayOutputStream();TripleHandler handler = new NTriplesWriter(out);runner.extract(source, handler);String n3 = out.toString("UTF-8");
Wednesday, June 29, 2011
Any23: Command-Line toolany23-core/bin$ ./any23
usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>] [-p] [-s] [-t] [-v] {<url>|<file>} -e <arg> comma-separated list of extractors, e.g. rdf-xml,rdf-turtle -f,--format <arg> Output format [turtle (default), ntriples, rdfxml, quad, uris] -l,--log <arg> logging, please specify a file -n,--nesting disable production of nesting triples -o,--output <arg> ouput file (defaults to stdout) -p,--pedantic validates and fixes HTML content detecting commons issues -s,--stats print out statistics of Any23
Wednesday, June 29, 2011
Any23: Web Serviceblacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http://www.bbc.co.uk/programmes/b00kygwh&report=on
<response> <extractors> <extractor>rdf-xml</extractor> </extractors> <report> ... <validationReport> <ruleActivations></ruleActivations> ... </validationReport> </report> <data> <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b00kygwh#programme"> <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/> <po:pid>b00kygwh</po:pid> <dc:title>The Terminator</dc:title> </rdf:Description> </data></response>
Wednesday, June 29, 2011
RDFaExtractor
Microformat Extractors
hReview hCalendar hCard
mimetype detection
hListing
Apache Tika
MicrodataExtractor
DOM extraction
Validator
Cyber Neko HTML
Rule Fix
ExtractionResult
RDF/XML Writer
JSON Writer
NQuadsWriter
Sesame
Wednesday, June 29, 2011
extractorpublic interface Extractor<Input> {
/** * Executes the extractor. Will be invoked only once, extractors are * not reusable. * * @param in The extractor's input * @param documentURI The document's URI * @param out Sink for extracted data * @throws IOException On error while reading from the input stream * @throws ExtractionException On other error, such as parse errors */ void run(Input in, URI documentURI, ExtractionResult out) throws IOException, ExtractionException;
/** * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of * this extractor. */ ExtractorDescription getDescription();
}
Wednesday, June 29, 2011
validate and fixpublic interface Rule {
String getHRName();
boolean applyOn( DOMDocument document, RuleContext context, ValidationReportBuilder validationReportBuilder );}
public interface Fix { String getHRName(); void execute(Rule rule, RuleContext context, DOMDocument document); }
void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix);
Wednesday, June 29, 2011
plugins
@PluginImplementation @Author(name="Michele Mostarda ([email protected])") public class HTMLScraperPlugin implements ExtractorPlugin {
private static final Logger logger = LoggerFactory.getLogger(HTMLScraperPlugin.class);
@Init public void init() { logger.info("Plugin initialization."); }
@Shutdown public void shutdown() { logger.info("Plugin shutdown."); }
public ExtractorFactory getExtractorFactory() { return HTMLScraperExtractor.factory; }
}
Wednesday, June 29, 2011
roadmap
incoming 0.6.0 release- support for Microdata- support for CSV- support for RDFa 1.1 prefix mechanism- improved app configuration- bug fixing
Apache (pre) Incubation process- http://wiki.apache.org/incubator/Any23Proposal- supporters and mentors (thanks guys!)Simone Tripodi (@stripodi)Tommaso Teofili (@tteofili)
- we’re looking for mentors
Wednesday, June 29, 2011
closing credits
active committers
Giovanni Tummarello ( @jccq )Michele Mostarda ( @micmos )
Davide Palmisano ( @dpalmisano )Richard Cyganiak ( @cygri )
thanks to the whole Semantic Web community, especially those who tirelessly challenge us
with bugs and features requests
Wednesday, June 29, 2011
References
[1] http://purl.org/goodrelations/v1
[2] http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/
Wednesday, June 29, 2011