graphical representations of knowledge and its distribution cliff behrens information analysis...

13
Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198 cliff@research.telcordia.com Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University, August 1 - 2, 2003

Upload: edwin-knight

Post on 05-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

Graphical Representations of Knowledge and Its Distribution

Cliff BehrensInformation AnalysisApplied ResearchTelcordia Technologies, [email protected]

Workshop on Statistical Inference, Computing and Visualization for Graphs

Stanford University, August 1 - 2, 2003

Page 2: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

Knowledge, Consensus and Information Sharing

Cultural Knowledge Derived from Consensus

Individual Knowledge

Information Sharing Among Individuals in a Single COI

Consensus Consensus Knowledge Knowledge

Page 3: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

Schemer Knowledge Validation Services

Issues with CSCW technology– Focus of CSCW research on new tools, less on motivating their use– Collaborative modeling building often lacks scientific rigor and quality control

Schemer Web-based technology that derives knowledge from consensus among Subject Matter Experts

– Knowledge-based collaboration reveals distribution of domain expertise among panelists

– Metrics for qualifying panelists and validating the models they produce validates saliency of domain to SMEs

estimates competency of SMEs

yields best answers based on responses of SMEs weighted by their respective competencies

Generic service, but first tried on SIAM® influence networks

Page 4: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

SIAM® Influence Net Example

Page 5: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

Mathematics of Consensus Analysis (Romney et al. 1986) Formal model consists of a data matrix X containing the responses Xik of SMEs 1..i..N on

items 1..k..M

– from this matrix a symmetrical matrix M* is estimated and holds the empirical point estimates M*ij, the

proportion of matching responses on all items between SMEs i and j, corrected for guessing (if appropriate), on off-diagonal elements.

Obtain approximate solution yielding estimates of the individual SME competencies (the D*i)

by applying Maximum Likelihood Factor Analysis to fit equation below and solve for the main diagonal values– M* = D*D*'

– relative magnitude of eigenvalues (λ1 > 3 λ2) implies single factor solution

D*i, are the loadings for SMEs on the first factor

– D*i = v1i {λ1}

Estimated competency values (D*i ) and the profile of responses for item k (Xik,l) used to

compute Bayesian a posteriori probabilities for each possible answer. The formula for the probability that an answer is best or “correct” one follows:

N

– Pr(<Xik> i=1 | Zk=l) = [D*i + (1-D*

i)/L]Xik,l [(1-D*i)(L-1)/L]1-Xik,l

i = 1

Page 6: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

Schemer Knowledge Validation Services

Page 7: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

Knowledge-Based Communications Interface

Structured Collaboration and Advice Network

• User’s relation to other SMEs

• Most similar point-of-view

• Most different point-of-view

• Someone a bit more knowledgeable

• Gurus

• Novel thinkers

Information Routing

• Supports/challenges one’s point-of-view

• Supports/challenges the consensus point-of-view

SME Contact Data

• Email services

• Meeting services

• Other plug-ins

Page 8: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

Latent Semantic Indexing (LSI): What is it?

Doc 3

Doc 2

Doc 1

memory

chip

Standard Vector Space Model(ndims = nterms)

com

puter

Doc 3

Doc 2

Doc 1

LSI Dimension 1

LS

I D

imen

sion

2

Reduced LSI Vector Space Model(ndims << nterms)

chip

memory

computer

Page 9: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

LSI: How Does It Work?

Analyze training collection of documents– throw-out stop words and mark-up– count frequencies of words in each document

Compute term document matrix– store word counts as entries in a matrix– apply appropriate weighting, e.g., log-entropy, to entries

Compute LSI vector space– reduce term document matrix with Singular Value Decomposition

Fold new documents into LSI vector space– document vector computed from weighted sum of its term vectors

Compute vector for query (“pseudo-document”)– query vector computed from weighted sum of its term vectors

Search vector space for semantically-close term/document vectors– compute cosine of angle between query and other vectors

Page 10: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

Scalability: Large Document Collections and Polysemy

Many Undifferentiated

Conceptual Domains/COIs

Many Undifferentiated

Conceptual Domains/COIs

"chip""wafer"

"chip""wafer"

potato

chipcorn

sugar

silicon

wafer valley

copper

Dimension 1

valleysilicon copper

Dimension 2

sugar cornpotato

waferchip

Page 11: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

LSI: Ongoing Work Distributed LSI

– Needed for LSI to scale to massive document collections Adopts “divide and conquer” approach

– Sort documents by conceptual domain recognizes documents created for different COIs create more semantically homogeneous subcollections apply cluster analysis, e.g., bisecting K-means

– Compute independent LSI vector spaces for each subcollection more parsimonious representations of concept domains or contexts

– Compute similarity measures between spaces construct graphs from terms shared by two vector spaces compute similarity between these two graphs

– Discover appropriate search vector spaces for a query cosine calculations (as before) relevance feedback (as before) query expansion Visualizations to explore semantic context for a query in different LSI vector spaces

Page 12: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

DLSI: Experiments with NSF-Movie Review Corpus

  

Vector Spaces Dimensions Non-stop Terms Documents

NSF-Geology 298 25,963   3,255

NSF-Engineering 229 30,247  3,057

NSF-Biology 224 38,176   3,645

Movie Reviews 239 70,411   3,557

All Documents 282   122,685  13,514

Page 13: Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198

DLSI: The Context of Term Meaning

  

Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing only NSF geology abstracts.

Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing only Ebert movie reviews.

Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing all documents.

center

research earth

reports travel

alien earth

science-fiction/sci-fi travel

cooperative earth

university center/center’s