data and text mining: the search for unknown knowns geoffrey bilder uksg, 2007 [email protected]
TRANSCRIPT
Data and text mining: the search for unknown knownsGeoffrey BilderUKSG, [email protected]
"Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."
The Mining Metaphor
Gold Mining
Diamond Mining
Data Mining
Data Mining- What it isn’t
≠ Information Retrieval
≠ Information Extraction
≠ Information Analysis
+ +
InformationRetrieval
InformationExtraction
InformationAnalysis
Data Mining new, previously unknown information
And so what is text data mining?
Text Mining
+ +
InformationRetrieval
InformationExtraction
InformationAnalysis
Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problem- then shouldn’t we be exploring new ways to
“publish”?
So how did we get here?
• The word tobacco originates from the Taino indians.
• There is no I in the word Team.
• The book captured the zeitgeist of the time.
• I am sure that I turned the gas off.
The book captured the <foreign_phrase lang="DE">zeitgeist</foreign_phrase> of the time.
I am <emphasis>sure</emphasis> that I turned the gas off.
Semantic Web “Light”
But we can do more...
The web as a database
Title Author ISBN-13 Publisher
LabyrinthsJorge Luis
Borges978-
0811200127New
Directions
Hopscotch Julio Cortazar978-
0394752846Pantheon
The AlephJorge Luis
Borges978-
0140286809Penguin
... ... ... ...
The Relational Model
Title Author ISBN-13 Publisher
LabyrinthsJorge Luis
Borges978-
0811200127New
Directions
Hopscotch Julio Cortazar978-
0394752846Pantheon
The AlephJorge Luis
Borges978-
0140286809Penguin
... ... ... ...
Rows represent things
Title Author ISBN-13 Publisher
LabyrinthsJorge Luis
Borges978-
0811200127New
Directions
Hopscotch Julio Cortazar978-
0394752846Pantheon
The AlephJorge Luis
Borges978-
0140286809Penguin
... ... ... ...
Columns are properties
Title Author ISBN-13 Publisher
LabyrinthsJorge Luis
Borges978-0811200127 New Directions
Hopscotch Julio Cortazar 978-0394752846 Pantheon
The AlephJorge Luis
Borges978-0140286809 Penguin
... ... ... ...
The book has an author “Jorge Luis Borges”
The thing’s property
Subject Predicate Object
The book has an author “Jorge Luis Borges”
Subject Predicate Object
URI URI
http://www.amazon.com/isbn/978-0140286809has an author
http://www.wikipedia.com/borges
RDF: Resource Description Framework
Journal A Journal B
Wiki
Blog
Personal Website
OPAC
Journal A Journal B
Wiki
Blog
Personal Website
OPAC
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT DISTINCT ?nameWHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name}ORDER BY ?name
SPARQL
http://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss
RSS 1.0
FRBR
Creative CommonsFOAF
Geo
SKOS
The Early Modern Internet
Data Mining =
With the goal of discovering new, previously unknown information
Information retrieval +Information extraction +Information analysis...
Data Mining =
Text Data Mining =
With the goal of discovering new, previously unknown information
Complex data extraction layer +data mining
Information retrieval +Information extraction +Information analysis...
Why do we publish text?
Thank [email protected]