rio info 2013 - linked data at globo.com
DESCRIPTION
PrsentaTRANSCRIPT
Linked Data at
Tatiana Al-Chueyr [email protected]@tati_alchueyr
18 de setembro de 2013, Simpósio Rio Info
globo.com
BROADCAST MOVIES PAY TV INTERNET
EVENTS MUSIC
PUBLISHING
NEW VENTURES NEWSPAPERRADIO NETWORK
Andréia Bustamante
Ícaro Medeiros
Tatiana Al-Chueyr
Rodrigo Senra
Semantic Team
Franklin Amorim
Diogo Kiss
Contributors
MotivationNot only words
São Paulo
MotivationNot only words
São Paulo?
MotivationNot only words
São Paulo state
MotivationNot only words
São Paulo city
MotivationNot only words
São Paulo saint
MotivationNot only words
São Paulo soccer team
MotivationMultiple words for the same thing
FemalefF
femalewoman
...
MotivationMultiple words for the same thing
http://data.globo.com/female
Motivation
Soccer player
Cross-link content from different web products
Politician
MotivationCross-link content from different web products
Celebrity
Motivation● Cross-link content from different web products
MotivationCross-link content from different web products
Isabella Nardoni foi morta em 29 de março de 2008
na Zona Norte de São Paulo (Foto:Reprodução)
Isabella de Oliveira Nardoni, de 5 anos, foi morta na noite de 29 de março de 2008. A perícia concluiu que a menina foi atirada do sexto andar do prédio onde moravam seu pai, Alexandre Nardoni, sua madrasta, Anna Carolina Jatobá, e dois filhos pequenos do casal, na Vila Isolina Mazzei, na zona norte de São Paulo.
Túmulo de Isabella vira local de visitação em SP; casal Nardoni está preso.
Caso Isabella Nardoni
Juliana Cardilli G1 SP
RDF
FOAF
GEO
Dublin Core
SKOS
Semantic markup in web pagesMotivation
Recommend annotations to information ProducerMotivation
Suggest related content to information Consumer Motivation
Suggest related content to information Consumer Motivation
Suggest related content to information Consumer Motivation
Changes● Replacement of words by entities
http://data.globo.com/person/Person/santos_dumont
Changes● Replacement of labels by qualified relationships
Changes● Organize data from tables to graphs
Outcomes ● To replace words by entities improved:
○ Finding
○ Linking
○ Reconciling
○ Organizing
multiple layers of information
Outcomes ● Flexible ways to organize content
● Ease to find related issues
● Explicit relations derived from annotated content
● Up-to-date topic pages with little editorial effort
● Linking content across different web products
● Seamless navigation leading to flow state
Status QuoUsed by the main web products of Globo.com:
○ 18,485 organizations
○ 83,000 people
○ 9,129 places
○ 1,000,000+ annotated news
Which sum up 2,500,000+ entities!
from August 2010 to May 2013
Linked dataproblems
Legacy Architecture
CDA
CMA
triple store
search engine
ontology
CDA
CMA
CDACMA
CDACMA
CDACMA
Legacy Architecture
triple store
search engine
ontology
Poor data management
○ direct access to triple store (unmanaged)
○ difficulty to share data (distributed DBs)
○ re-sync triple-store and search engine index
○ scalability of triple store
○ high entropy in distributed ontology engineering
Problems
Problems
Ontology Engineering
Domain-driven(current)
Base
G1 GE EGO TVG
news sports gossip tv
Upper
Person Organization
Music
Politics
Programme Education
Sports
Product-driven(past)
Place
Possible Solution
UpperOntology
Semantic as a library
○ many different versions in production
○ programming language dependent
○ steep learning curve for RDF/OWL/SPARQL
Problems
Create an open semantic data management platform
● Scalable
● Mobile and Web friendly
● Interconnect Globo's data with external data sources
● Automate content extraction (including NER)
Solution
Brainiaklinked data restful API
CDA
CMA
CDACMA
CDACMA
CDACMA
Legacy Architecture
triple store
search engine
ontology
APIBrainiak
CMA
CDA
CDA
CDA
CDA
triple store
search engine
Under Development
Requirements● Indirect usage of SPARQL
● Programming language independent
● Data management with quality
● Finer-grained authorization and authentication
● Isolate applications from triplestore
● Improve triplestore performance
SPARQL query DEFINE input:inference <http://data.globo.com/ruleset> SELECT ?uri ?label FROM <http://data.globo.com/sports/> WHERE { ?uri a <http://data.globo.com/sports/Team>; rdfs:label ?label . } LIMIT 10 OFFSET 0
task: list all sports teams
/sports/Team
Brainiak query
GET
SPARQL response
Brainiak response
SPARQL query
SELECT DISTINCT ?classWHERE { <http://data.globo.com/place/City> rdfs:subClassOf ?class OPTION (TRANSITIVE, t_distinct, t_step('step_no') as ?n, t_min (0)) . ?class a owl:Class .}
task: retrieve all superclasses of a class
SPARQL query SELECT DISTINCT ?predicate ?predicate_graph ?predicate_comment ?type ?range ?title ?range_graph ?range_label ?super_propertyWHERE { { GRAPH ?predicate_graph { ?predicate rdfs:domain ?domain_class } . } UNION { graph ?predicate_graph {?predicate rdfs:domain ?blank} . ?blank a owl:Class . ?blank owl:unionOf ?enumeration . OPTIONAL { ?enumeration rdf:rest ?list_node OPTION(TRANSITIVE, t_min (0)) } . OPTIONAL { ?list_node rdf:first ?domain_class } . } FILTER (?domain_class IN (<http://data.globo.com/place/City>, <http://data.globo.com/place/GeopoliticalDivision>, <http://data.globo.com/place/Place>, <http://data.globo.com/upper/Object>, <http://data.globo.com/upper/Substance>, <http://data.globo.com/upper/ConcreteEntity>, <http://data.globo.com/upper/Entity>)) {?predicate rdfs:range ?range .} UNION { ?predicate rdfs:range ?blank . ?blank a owl:Class . ?blank owl:unionOf ?enumeration . OPTIONAL { ?enumeration rdf:rest ?list_node OPTION(TRANSITIVE, t_min (0)) } . OPTIONAL { ?list_node rdf:first ?range } . } FILTER (!isBlank(?range)) ?predicate rdfs:label ?title . ?predicate rdf:type ?type . OPTIONAL { ?predicate rdfs:subPropertyOf ?super_property } . FILTER (?type in (owl:ObjectProperty, owl:DatatypeProperty)) . FILTER(langMatches(lang(?title), "en") OR langMatches(lang(?title), "")) . OPTIONAL { ?predicate rdfs:comment ?predicate_comment } FILTER(langMatches(lang(?predicate_comment), "en") OR langMatches(lang(?predicate_comment), "")) . OPTIONAL { GRAPH ?range_graph { ?range rdfs:label ?range_label . FILTER(langMatches(lang(?range_label), "en") OR langMatches(lang(?range_label), "")) . } }}
task: retrieve all properties of a group of classes
SPARQL query SELECT DISTINCT ?predicate ?min ?max ?range ?enumerated_value ?enumerated_value_labelWHERE { <http://data.globo.com/place/City> rdfs:subClassOf ?s OPTION (TRANSITIVE, t_distinct, t_step('step_no') as ?n, t_min (0)) . ?s owl:onProperty ?predicate . OPTIONAL { ?s owl:minQualifiedCardinality ?min } . OPTIONAL { ?s owl:maxQualifiedCardinality ?max } . OPTIONAL { { ?s owl:onClass ?range } UNION { ?s owl:onDataRange ?range } UNION { ?s owl:allValuesFrom ?range } OPTIONAL { ?range owl:oneOf ?enumeration } . OPTIONAL { ?enumeration rdf:rest ?list_node OPTION(TRANSITIVE, t_min (0)) } . OPTIONAL { ?list_node rdf:first ?enumerated_value } . OPTIONAL { ?enumerated_value rdfs:label ?enumerated_value_label . } . }}
}
task: retrieve the cardinalities of all properties of a certain class
/place/City/_schema
Brainiak query
GET
● Enrich Globo.com search
● SEO (automatic schema.org)
● Improve annotator (DBpedia Spotlight)
● Richer content relationships (inference)
● Link to open data (e.g. DBPedia, dados.gov.br)
Next steps
Stay tuned
@brainiak_api
... will be soon released
as an open source project !
http://www.slideshare.net/
@semantic_team@alchueyr
Slides