Multidimensional analysis modelfor a document warehousethat includes textual measures
KIM JEONG RAE
UOS.DML. 2015.11.27.
1
Introduction
Author Martha Mendoza, Erwin Alegria, Manuel Maca, Carlos Cobos, Elizabeth Leon
Location Information Technology Research Group(GTI), etc. Colombia
Title Multidimensional analysis model for a document warehouse that includes textual
measures
Document Type Decision Support Systems 72(2015) 44-59
Date February 2015
2
Contents
Abstract Analysis Model
Proposed document warehouse model Multi-dimensional model
Textual measures and aggregation function
OLAP document visualization
Conclusion Evaluation results
3
Abstract(1/2)4
Motivation Business systems are increasingly required to handle substantial quantities of unstruc-
tured textual information.
Problem To manage unstructured text data stored in data warehouses
Approach The new multi-dimensional analysis model is proposed that includes textual measures
as well as a topic hierarchy.
The textual measures that associate the topics with the text documents are generated by Probabilistic Latent Semantic Analysis, while the hierarchy is created automatically using a clustering algorithm.
Abstract(2/2)5
Result The model gained an increasing acceptance with use, while the visualization of the
model was also well received by users.
Contribution This paper proposes a multidimensional model that incorporates textual.
The model allows documents to be queried using OLAP operations.
Proposed document warehouse model6
Four main Processes
②
①
③
④
Proposed document warehouse model7
Topic Hierarchy Building ① Two algorithms process
Cosme(step1)
Modified IGBHSK(Iterative Global-Best Harmony Search K-means algorithm)
8
Topic Hierarchy Building ① Modified IGBHSK(Iterative Global-Best Harmony Search K-means algorithm) : Three levels
Proposed document warehouse model
9
Topic Hierarchy Building ① IGBHSK algorithm[Ref.#2] for Topic hierarchy
Proposed document warehouse model
Proposed document warehouse model10
Probabilistic measures calculation ②
11
Probabilistic measures calculation ② PLSA(Probabilistic Latent Semantic Analysis) algorithm [Ref.#24]
A Probability model given a set of documents with words
P(d|z) : the probabilities of the topics in the document
P(w|z) : the probabilities of the words in the topics
EM(Expectation Maximization) algorithm[Ref.#6,17]
Proposed document warehouse model
Proposed document warehouse model12
ETL(Extract-Transform-Load) ③
Multi-dimensional model13
Relational DB
Schema
Multi-dimensional model14
Standard dimensions Document dimension : name, document type
Author dimension : name, email
Date dimension : publish date
Location dimension : city, country
Word dimension : all words from the stored document set
Topic dimension : Topic hierarchy
M-M relationships Author-Group Bridge, Topic-Document-Group Bridge, Topic-Word-Group Bridge
Measures of the fact table and the topic and word dimension bridge tables Topics_Probab_TM : A average Probability of Topics
Documents_TM : Probabilities of a Document within topics
Word_Probab_TM : Probabilities of a word within topics
Proposed document warehouse model15
Multidimensional cube building ④
Textual measures and aggregation function16
Topic_Probab_TM Measure
R : the number of documents recovered by the query
A : the total number of distinct topics in the documents recovered in AM
Textual measures and aggregation function17
Documents_TM Measure
: each row in the query
B : the total number of distinct documents recovered in the query
m : the number of topics in the Topic dimension
Textual measures and aggregation function18
Word_Probab_TM Measure
: each row in the query
B : the total number of distinct words recovered in the query
m : the number of topics in the Topic dimension
OLAP document visualization19
Topics_Probab_TM : Document dimension - Type of Document
OLAP document visualization20
Topics_Probab_TM : Date Dimension - year
OLAP document visualization21
Topics_Probab_TM : Document type(rows) and year attribute(columns)
OLAP document visualization22
Topics_Probab_TM : Attribute of year and Document type Slice – “Journal Article”
OLAP document visualization23
Topics_Probab_TM : Attribute of year and Document type and author name Dice operation
OLAP document visualization24
Document_TM : each Topic and Document
OLAP document visualization25
Document_TM : each Topic and year and Document
Conclusion - Evaluation results26
Execution time results
Conclusion - Evaluation results27
Execution time results
Conclusion - Evaluation results28
User satisfaction results Statistical frequency analysis
Conclusion - Evaluation results29
User satisfaction results Multivariate analysis
Thank you
30
Proposed document warehouse model31
Results Cosme : XML file(Metadata)
Proposed document warehouse model32
Result IGBHSK