nuts and bolts of taxonomies webinar - auto … · the nuts and bolts of metadata tagging and...

36
© Concept Searching 2017 The Nuts and Bolts of Metadata Tagging and Taxonomies Made Easy Michael Paye Chief Technology Officer Concept Searching [email protected] www.conceptsearching.com [email protected] Twitter @conceptsearch

Upload: danganh

Post on 29-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

© Concept Searching 2017

The Nuts and Bolts of Metadata Tagging

and Taxonomies Made Easy

Michael Paye

Chief Technology Officer

Concept Searching

[email protected]

www.conceptsearching.com

[email protected]

Twitter @conceptsearch

© Concept Searching 2017

Michael Paye – Chief Technology Officer at Concept Searching

has been the driving force behind many of the company's recent

innovations, including the SharePoint Add-in and hybrid search

products. He has a wealth of experience across the Microsoft

platform and related technologies, and oversees all product

development.

© Concept Searching 2017

Agenda

• Who we are and what we do

• What’s the problem?

• What does it impact?

• How do you measure performance?

• Metadata generation

• Auto-classification – What does it do?

• Taxonomies – What kinds are there?

• SharePoint Term Store

• Calculating return on investment

© Concept Searching 2017

• Company founded in 2002

• Product launched in 2003

• Focus on management of structured and unstructured information

• Profitable, debt free

• Technology Platform

• Delivered as a web service

• Automatic concept identification, content tagging, auto-classification,

taxonomy management

• Only statistical vendor that can extract conceptual metadata

• 8 years KMWorld ‘100 Companies that Matter in Knowledge Management’

8 years KMWorld ‘Trend Setting Product’

• Authority to Operate enterprise wide US Air Force, NETCON US Army,

and Canadian SLSA

• Client base: Fortune 500/1000 organizations in Healthcare,

Financial Services, Manufacturing, Energy, Professional Services,

Pharmaceutical, Public sector and DoD

• Microsoft Gold Certification in Application Development

• Member of SharePoint PAC and TAP programs

• Deployed as a full trust Add-in for all versions of SharePoint on-premises

and SharePoint Online, including the latest vNext dedicated platform and the

government cloud

The Global Leader in

Managed Metadata Solutions

© Concept Searching 2017

Concept Searching’s technology platforms deliver

semantic metadata generation, auto-classification and

taxonomy/Term Store management, and are fully

integrated with all versions of SharePoint on-premises,

Microsoft Online/Office 365, and OneDrive for Business

What Do We Do?

These infrastructure platforms integrate not only with

SharePoint but also other content repositories, search

engines and file shares, enabling our clients to add

structure and manage their enterprise content,

regardless of environment

The resulting classification metadata is used by clients

to deliver ‘intelligent metadata solutions’ in areas such

as enhanced search, migration, data privacy, records

management, policy enforcement, compliance, text

analytics, and business and social collaboration

© Concept Searching 2017

“Over 80% of business decisions are made using unstructured data.” IDC

What’s the Problem?

© Concept Searching 2017

• 91% use manual metadata tagging

• Free-for-all mode

• Drop down lists

• 15% maintain a home-grown manual

taxonomy

• 77% have no rhyme or reason for

managing content

Information Chaos

• Unstructured data is growing at the rate of 62% per year IDG

• By 2022, 93% of all data in the digital universe will be unstructured IDG

• Data volume is set to grow 800% over the next five years and 80% of it

will reside as unstructured data Gartner

What’s the Problem?

© Concept Searching 2017

It’s not just about search

What Does it Impact?

© Concept Searching 2017

How do you measure performance?

© Concept Searching 2017

Precision Versus Recall

• Usually used by academics

• Precision

• Positive predictive value

• Fraction of retrieved instances that are

relevant

• Recall

• Sensitivity

• Correct number of documents that are

relevant

• Fraction of relevant instances that are

retrieved

• In a perfect world, they should be balanced

• Commercial evaluation criteria also take into

account

• Order of the returned results

• Overall ability of a user to find an answer

rather than relying on a search being

submitted only once

© Concept Searching 2017

• Automated metadata generation is

difficult to achieve consistently with

high precision and recall

• Many products on the market today

require complex rules to be generated

often involving search syntax,

complicated Boolean expressions

• Some require a document training set

for every term to be processed

• Some of these products employ

linguistic techniques that will not

perform consistently across different

vertical markets

Result is very high initial cost in terms of

time and level of qualified staff

Precision Versus Recall

© Concept Searching 2017

“The quality of your metadata will impact the quality of auto-classification

and ultimately negate your outcomes – increasing organizational risk

and noncompliance.”

Metadata

© Concept Searching 2017

Definition

• Metadata describes other data, it

provides information about a certain

item's content

• For example, an image may include

metadata that describes how large

the picture is, the color depth, the

image resolution, when the image

was created, and other data

• A text document's metadata may

contain information about how long

the document is, who the author is,

when the document was written, and

a short summary of the document

TechTerms.com

Metadata

© Concept Searching 2017

Types of Classification Metadata

Intrinsic

• Information that can be extracted directly

from an object (file name, size)

Administrative/Management

• Information used to manage the

document (author, date created,

date to be reviewed)

Descriptive

• Information that describes the object

(title, subject, audience)

Semantic

• Ability to extract concepts from within

content and generate the metadata

(intelligent metadata)

© Concept Searching 2017

A manual metadata approach will fail 95% of the time

Why is Metadata So Hard to Get Right?

© Concept Searching 2017

Advantages

• Ability to develop a single repository of organizationally relevant

metadata to be made available to any application that requires the use

of metadata

• Elimination of costs and errors associated with end user tagging

• Normalization of content across functional and geographic boundaries

to remove ambiguity in vocabulary

• Metadata managed and changed in one place

• Ability to apply policy consistently across diverse repositories and

applications

• Provide flexibility to rapidly make changes to the repository for

regulatory compliance where changes are immediately available for

use by applications

Metadata

© Concept Searching 2017

Automatic generation of compound term metadata

© Concept Searching 2017

Auto-classification

“By itself the search function has limited value. The real value of search

and information access technologies is in the ongoing efforts needed to

establish effective taxonomies, to index and classify content of all kinds, in

order to provide meaningful results.” Tom Eid, Research Vice President

Gartner Group

© Concept Searching 2017

• A feature found in some content management

systems or records management applications

that will scan the contents of a document and

automatically assign metadata, categories,

and keywords based on the document

contents

• Content-based assignment of one or more

pre-defined categories to documents

(records), usually machine learning, statistical

pattern recognition, or neural network

approaches that are used to construct

classifiers automatically

What is Auto-classification?

© Concept Searching 2017

Auto-classification Systems – What Do They Do?

Document

Preparation • Split into language

blocks (paragraphs,

headings),

formatting, layout

Parsing • Entity extraction

• NLP: parts of speech,

phrases

• Terms, variants

Weighting • Frequency

• Location in text,

phrase

• Proximity

• Combination

• Format of text

Classification • If threshold reached

• Can influence search

results

This is where rules

vs statistics come

into play… Not all classification solutions are created equal

© Concept Searching 2017

Auto-classification Systems

Keyword

• Boolean operators add a degree of sophistication,

but also tend to improve precision at the expense

of recall, because any document that does not

match the Boolean expression is ignored

• The majority of search users are unable to

formulate even basic Boolean expressions

Linguistic

• No commitment to a taxonomic tree

• Related to parts of speech, syntactic parses,

or semantic interpretations

• Typically not scalable

• Usually delivered as pre-configured for an

industry, hard to integrate your unique

organizational vocabulary

© Concept Searching 2017

Semantic Networks

• Refers to a set of relationships between

concepts and words, including parts of

speech and real-world relationships

• These can include rules of various types,

not just Boolean

Machine Learning

• Subfield of computer science (CS)

and artificial intelligence (AI) that deals with

the construction and study of systems that

can learn from data, rather than follow only

explicitly programmed instructions

Auto-classification Systems

© Concept Searching 2017

Training Sets

• Specify a set of documents that should be

classified against each term, this becomes the

training set

• If errors, provide more pre-classified documents

to the training set

• Repeat as necessary

Rule-based

• Rule-based classifiers allow the criteria that

causes classifications to be explicitly defined

• Two types

• Exact matching based on keywords, phrases,

Boolean

• Deliver a binary result – the document

either matches the term or it does not

• Fuzzy matching that accumulates evidence

that a document matches each term – sort of

Auto-classification Systems

© Concept Searching 2017

Auto-classification in action

© Concept Searching 2017

Taxonomies

“The metadata infrastructure provides the critical glue that binds the

information infrastructure to the underlying IT infrastructure.

Sound information governance practices would take advantage of the

metadata infrastructure, to ensure that content and data are managed

consistently and adhere to written policies, across on-premises and

cloud-based environments.” IDC

© Concept Searching 2017

Taxonomies

Taxonomy

• A taxonomy is an organized set of

concepts or definitions, usually labeled

keywords

• For search engines, a taxonomy can

also be a set of organized searches

• Taxonomies are typically nested in a

hierarchical manner, often called a ‘tree’

• Subject-based taxonomy – created by

domain experts

• Content-based taxonomy – organizing

the data you already have

• Behavior-based taxonomy – driven by

search analytics, user tagging, or

vocabulary analysis

© Concept Searching 2017

Types of Taxonomies

List, Picklist, Controlled Vocabulary, Authority Files

List of lead or preferred terms, selected by the end

user, may or may not have relationships among the

terms, can include a synonym ring

Synonym Lists

The use of synonyms allows one concept to be

instantiated as the same as the other, but still

allows a term to be preferred over another

Hierarchical

Each content item resides in only one category,

referred to as a ‘tree’

• Piano

• Musical instrument

© Concept Searching 2017

Types of Taxonomies

Polyhierarchical, Faceted, Thesauri

Content items can exist in more than one category,

more structured controlled vocabulary, provides

information about each term and its relationship to

other terms, features of a hierarchical taxonomy

plus associative relationships

• Piano

• Musical instrument

• Stringed instrument

• Percussion instrument

Ontology

Multiple taxonomies with additional relationships

added to specify concepts within a domain

Marlene Rockmore – The Taxonomy Blog

Heather Hedden – The Accidental Taxonomist

© Concept Searching 2017

Set up a taxonomy node, suggest clues for class, document feedback

© Concept Searching 2017

SharePoint Term Store

• Introduced in 2010

• Provides infrastructure for

taxonomy management

• Managed metadata properties

designed for hierarchical

metadata

• Integrated with search via the

refinement panel

SharePoint has no automatic generation of metadata

SharePoint has no auto-classification capability

SharePoint has no facility to generate concepts

© Concept Searching 2017

Globally Unique Identifiers (GUIDs)

• SharePoint uses GUIDs to identify

taxonomies and terms

• GUIDs must be preserved when

updating term sets

• GUIDs need to synchronized between

farms

• Concept Searching preserves GUIDs

SharePoint Term Store

© Concept Searching 2017

Automatic, real-time update of the SharePoint Term Store

© Concept Searching 2017

Return On Investment

© Concept Searching 2017

Return On Investment – Real World Savings

Pique Solutions

The Business Solutions

• Search

• Records Management

• Intelligent Migration

• Data Security/Confidentiality

• eDiscovery/Litigation

Support, FOIA

• Information Governance

• Text Analytics

• Business Social Networking

• Collaboration

• Content Lifecycle

Management

• Metadata Management

• Research

• Knowledge Management

© Concept Searching 2017

Next Expert Webinar

Groundbreaking and Game-changing Enterprise Search

Wednesday, March 8th 2017

Register

Concept Searching and strategic partner C/D/H discuss what intelligent

enterprise search should be.

This webinar demonstrates a solution unique in the marketplace, which

overcomes the limitations of other enterprise search engines.

Read more and register in the Upcoming Webinars area of our website.

© Concept Searching 2017

Thank You

Michael Paye

Chief Technology Officer

Concept Searching

[email protected]

www.conceptsearching.com

[email protected]

Twitter @conceptsearch