design of a multi-dimensional query expression for document warehouses
TRANSCRIPT
![Page 1: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/1.jpg)
Information Sciences 174 (2005) 55–79
www.elsevier.com/locate/ins
Design of a multi-dimensionalquery expression for document warehousesq
Frank S.C. Tseng
Department of Information Management, National Kaohsiung First University
of Science and Technology, Kaohsiung 811, Taiwan, ROC
Received 16 November 2003; received in revised form 25 August 2004; accepted 26 August 2004
Abstract
During the past decade, data warehousing has been widely adopted in the business
community. It provides multi-dimensional analyses on cumulated historical business
data for helping contemporary administrative decision-makings. However, many data
warehousing query language in present only provides on-line analytical processing
(OLAP) for numeric data. For example, MDX (Multi-Dimensional eXpressions) has
been proposed as a query language to allow describing multi-dimensional queries over
databases with OLAP capabilities. Nevertheless, it is believed there is only about 20%
information can be extracted from data warehouses concerning numeric data only,
the other 80% information is hidden in non-numeric data or even in documents. There-
fore, many researchers now advocate it is time to conduct research works on document
warehousing to capture complete business intelligence. Document warehouses, unlike
traditional document management systems, include extensive semantic information
about documents, cross-document feature relations, and document grouping or cluster-
ing to provide a more accurate and more efficient access to text-oriented business intel-
ligence. In this paper, we extend the structure of MDX into a new one containing
0020-0255/$ - see front matter � 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2004.08.010
q This research was partially supported by National Science Council, Taiwan, ROC, under
Contract No. NSC-91-2416-H-327-005.
E-mail address: [email protected]
![Page 2: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/2.jpg)
56 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
complete constructs for querying document warehouses. The traditional MDX only
contains SELECT, FROM, and WHERE clauses, which is not rich enough for docu-
ment warehousing. In this paper, we present how to extend the language constructs
to include GROUP BY, HAVING, and ORDER BY to design an SQL-like query lan-
guage for document warehousing. The work is essential for establishing an infrastruc-
ture to help combining text processing with numeric OLAP processing technologies.
Hopefully, the combination of data warehousing and document warehousing will be
one of the most important kernels of knowledge management and customer relationship
management applications.
� 2004 Elsevier Inc. All rights reserved.
Keywords: Data warehousing; Document warehousing; Knowledge management; Multi-Dimen-
sional eXpressions (MDX); OLAP
1. Introduction
Data warehousing [24] and data mining techniques [14] are gaining in
popularity as organizations realize the benefits of being able to perform mul-
ti-dimensional analyses of cumulated historical business data to help contem-
porary administrative decision-making [1,3,14,25,12]. However, much of theefforts have only touched the tip of the information iceberg. It is believed that
[20], for the business intelligence of an enterprise, there are only about 20%
information can be extracted from formatted data stored in relational dat-
abases. The remaining 80% information is hidden in unstructured or semi-
structured documents. For instances, market survey reports, project status
reports, meeting records, customer complaints, e-mails, patent application
sheets, and advertisements of competitors are all recorded in documents. De-
spite that, documents in the Web, enterprise repositories, and public documentmanagement systems are all growing as well. Therefore, knowledge workers,
managers, and executives still have to spend much of the working moment
reading dozens, if not hundreds, of various types of electronic documents
spread over the Internet. There is just too much text to digest in daily life.
The fast-growing and tremendous amount of documents has far exceeded hu-
man�s ability for comprehension without powerful tools. As a result, it is not
only some relevant documents may be ignored being taken into account, but
some irrelevant documents may also be considered by intuition when doingsome important decision-makings. We believe that leaving out information in-
duced from relevant documents or keeping information by intuitively guessing
from irrelevant documents may be detrimental, causing disaster from the strat-
egy weaved by incomplete information.
To alleviate such phenomenon, Sullivan [33] has advocated that documents
should be properly warehoused according to some well-defined concepts for
![Page 3: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/3.jpg)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 57
expanding the scope of business intelligence to include textual information.
Hence, one of the next challenges in information community will be the study
of topics about document warehousing and text mining to help enterprises on
obtaining the complete business intelligence. Although research works regard-
ing text mining have been conducted widely (for example, the gentle readers are
referred to [23,26–28,30,34]), however, the issues regarding document ware-house are rarely addressed. We have proposed a multi-dimensional indexing
structure, called D-tree [35], for constructing document warehouses. In this
work, we will focus our work on the design of a query language for querying
document warehouses.
The importance of the design of a good query language for document ware-
housing can be seen from observing the history of relational database systems.
The standardization of relational query languages, SQL (Structured Query
Language), is widely accepted and credited for the success of the relationaldatabase systems in the past decades. After the appearance of data warehous-
ing, applications that provide users with multi-dimensional analysis of data are
becoming widespread, and in a few years may even be considered pervasive.
Although SQL can be used to handle basic analytical operations on top of rela-
tional databases, it is cumbersome and lacks of important semantics for multi-
dimensional analysis. Based on this observation, MDX (Multi-Dimensional
eXpressions) [31] was proposed to allow describing multi-dimensional queries
over databases or data warehouses with OLAP (On-Line Analytical Process-ing) capabilities. MDX has been widely adopted and supported by Applix
[15], Microsoft [16], Microstrategy [17], Whitelight [22], SAS [19], and SAP
[18].
Since there are usually many diverse concepts involved in a document, a
document is also multi-dimensional in nature. Document warehouses, unlike
traditional document management systems, include extensive semantic infor-
mation about documents, cross-document feature relations, and document
grouping or clustering to provide a more accurate and more efficient accessto text-oriented business intelligence. To facilitate flexible and effective
multi-dimensional on-line analytical document processing and browsing, a
multi-dimensional query language for querying document warehouses is indis-
pensable. A good query language for document warehousing may help stand-
ardize the development of platforms for document warehousing and text
mining systems.
In this paper, we extend the structure of MDX into a new one, called MD2X
(Multi-Dimensional Document Expressions), containing complete SQL-likeconstructs for querying document warehouses. The traditional MDX only con-
tains SELECT, FROM, and WHERE clauses. Although this is rich enough for
data warehouse queries, however, it is not rich enough for document ware-
house queries. We will present how to extend the language constructs into
MD2X to include GROUP BY, HAVING, and ORDER BY clauses to adopt
![Page 4: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/4.jpg)
58 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
an SQL-like syntax and reflect the complete language structure of the tradi-
tional SQL.
The work is essential for establishing an infrastructure to help combining
text processing with numeric OLAP processing technologies. Hopefully, the
combination of data warehousing and document warehousing will be one of
the most important kernels of knowledge management and customer relation-ship management applications.
As the Web applications proliferate tremendously, there will be a great deal
of needs on rapid text processing and browsing. Document warehousing does
not only provide an infrastructure for developing tools for business executives
to systematically organize, understand, and properly categorize their docu-
ments to help strategic decision-makings, but also integrate all kinds of related
documents being browsed instantly.
Document warehousing also provides an important platform for on-lineanalytical processing (OLAP) in text level for the interactive analysis of
multi-dimensional documents of various granularities, which facilitates effec-
tive text mining, integrates documents into the business intelligence infrastruc-
ture, and provides the means to search for and target specific information the
way we now do with numeric data. Furthermore, as the construction of data
warehouses can be viewed as an important step for data mining, the construc-
tion of document warehouses can be regarded as an indispensable preprocess-
ing step for text mining. We realize that, no matter how wonderful themechanism a system adopts, it cannot do much without a good content organ-
ization of the domain on which it is to work. Moreover, we often recognize
that, once a good content organization is available, many different mechanisms
might be employed equally well to implement effective systems. A well-organ-
ized document warehouse just provides various mechanisms a wonderful con-
tent organization to work on.
Our paper is organized as follows. In Section 2, the important concepts of
document warehousing are formally presented. Based on these concepts, we ex-tend MDX to the MD2X structures for querying document warehouses in Sec-
tions 3 and 4. Then, in Section 5, we put the necessary constructs all together to
form an SQL-like multi-dimensional query language and discuss the evaluation
steps for document warehousing. Finally, we conclude and propose some fu-
ture works in Section 6.
2. An introduction to document warehousing
In the following, we give some definitions about dimensions, document tu-
ples, and document cubes for document warehousing. Then, based on these
definitions, the formal definition of MD2X will be defined in the following
sections.
![Page 5: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/5.jpg)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 59
Definition 1. A dimension D is a tree structure of m levels, m P 1, which is used
for representing the hierarchical relationships among a set of keywords. A node
in a dimension D is called a member, and each internal node contains a special
child called summary member, denoted �*�, which is used for denoting the total
concept of the other children of the internal node.
When drawing a dimension, we usually leave out a summary member, since
it has the same meaning with its parent node. Besides, the keywords in a dimen-
sion are not limited to only those contained in document contents. Any prop-
erty or metadata of a document file (e.g., such as those defined in Dublin Core
Metadata Element Set [8]) can also be regarded as a keyword in a dimension
for constructing document cubes. That is, according to the keyword sources,
dimensions can be distinguished into the following two types:
1. Metadata dimension. A dimension contains keywords used for scanning doc-
ument file properties or metadata. For example, in Dublin Core Metadata
Element Set, there are title, creator, subject, description, publisher, contrib-
utor, date, type, format, identifier, source, language, relation, coverage, and
rights, all can be regarded as keywords in their corresponding metadata
dimensions.
2. Ordinary dimension. A dimension contains keywords used for scanning the
document contents.
To simplify our discussion, we mainly use ordinary dimensions, together
with the metadata dimension time (i.e., date), in the following examples.
Definition 2. For a dimension D, the ith-level member set, denoted D(i), is
defined as D(i) = {aj a is a member in the ith level of D, but a is not a summary
member}. Besides, we use D(0) to denote the union of all non-summary
members in D, which is the union of all ith level member sets in D. That is,D(0) = ¨16i6hD(i), where h is the height of D. In practice, each D(i) has a
specific name, which will be called the ith-level name.
Practically, a dimension can be constructed from a relational table, with
each level corresponds to an attribute in the relation and the attribute names
are usually used as the corresponding level names. To illustrate the above def-
initions, we give an example as follows. Besides, any keyword in a dimension
can be implemented as a set of synonyms to encompass more semantics.
Example 1. Suppose there is a relation Region representing the regions of
Taiwan as shown in Table 1. This relation can be used to construct a
dimension, denoted R as depicted in Fig. 1, where the first level corresponds to
the dimension itself (which is commonly denoted �(All Region)�), and the second
![Page 6: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/6.jpg)
Table 1
A relation Region for constructing dimension R
Region
� � � Location City � � �� � � North Taipei � � �� � � North Taoyun � � �� � � North Hsinchu � � �� � � South Tainan � � �� � � South Kaohsiung � � �� � � South Pingtong � � �
Level
All Region
North South*
Taipei Taoyun Hsinchu Tainan Kaohssiung Pingtong* *
1
2
3
Fig. 1. An illustration of dimension R.
60 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
and third levels are derived from the attributes Location, and City, respectively.
All nodes with label �*� are summary members. That is, the summary memberin the second level has the same meaning with all regions in Taiwan, which
represents {South, North}. Besides, the summary members under South and
North have the same corresponding meaning with South and North, which
denote {Tainan, Kaohsiung, Pingtong} and {Taipei, Taoyun, Hsinchu}, respec-
tively. By omitting all the summary members, Fig. 1 is redrawn in Fig. 2.
According to the illustration of dimension R, we know that R(1) = {(All
All Region
North South
Taipei Taoyun Hsinchu Tainan Kaohssiung Pingtong
Level
1
2
3
Level Name
All Region
Location
City
Fig. 2. A concise illustration of dimension R.
![Page 7: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/7.jpg)
All Product
Applicance Computer
TV Refrigerator CellularPhone
Radio Monitor Printer
Level
1
2
3
Communication
Fig. 3. A concise illustration of dimension P.
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 61
Region)}, R(2) = {South, North}, and R(3) = {Tainan, Kaohsiung, Pingtong,
Taipei, Taoyun, Hsinchu}, and R(0) = {(All Region), South, North, Tainan,
Kaohsiung, Pingtong, Taipei, Taoyun, Hsinchu}. In Fig. 3, another dimension,
denoted P, representing the products of a company manufacturing consumerelectronics is concisely depicted. Both dimensions will be used in the following
examples.
For a dimension D, there are two basic operations called drill-down and
roll-up, which are formally defined as follows.
Definition 3. For a dimension D, expanding an internal node to obtain all of its
children is called drill-down, and shrinking a set of children to obtain theircommon parent is called roll-up.
By rolling up and drilling down, users can browse a document cube from
different perspectives, obtaining further insight into relationships among docu-
ments. This can be further clarified by the following definitions.
Definition 4. For any two n-tuple of keywords A = (a1,a2, . . .,ai, . . .,an) and
B = (b1,b2, . . .,bi, . . .,bn) defined on n dimensions (D1,D2, . . .,Di, . . .,Dn), whereai and bi 2 Di(0), we define B is a member of drilling down A along dimension Di,
denoted A �i B, if and only if there exists exactly an i, 1 6 i 6 n, such that bi is
a child of ai in Di, and bj = aj, for all j 5 i.
Definition 5. For a document T with unique identifier idT, a document index of
T defined on n dimensions (D1,D2, . . .,Dn) is denoted x = (idT,KT), where
KT = (K1,K2, . . .,Ki, . . .,Kn) is an n-tuple of keyword sets, such that each Ki
contains a set of keywords, and for all keywords tij 2 Ki, tij occurred in T andtij 2 Di(0), for all 1 6 i 6 n. In the following, the first and second components
![Page 8: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/8.jpg)
To whom it may concern: We have bought a TV from your Kaohsiung branch last weekend. However, we found the screen is severely unstable. Please give us the phone number of your service center. Thank you for your kindly help. Sincerely, Frank S.C. Tseng
Fig. 4. A complaint e-mail issued by a customer (A0001).
62 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
of a document index x = (idT,KT) will be denoted x1 and x2 (i.e., x1 = idT andx2 = KT), respectively. When all jKij = 1, the document index is also called a
base document index, and each Ki can also be denoted by its only element for
convenience. (That is, in such cases, a KT = ({t1},{t2}, . . ., {ti}, . . ., {tn}) can be
abbreviated as KT = (t1, t2, . . ., ti, . . ., tn).) If there are at lease one Ki, such that
jKij > 1, and the sizes of the other Kj�s all equal to 1, then the document index
is also called a composite document index. Finally, if there are some Ki, such
that jKij = 0, then the document index is also called a degenerate document
index.
Example 2. Suppose there is a complaint e-mail issued from a customer as
shown in Fig. 4. Then, a base document index of T defined on the above two
dimensions (R,P) can be obtained as x = (A0001, ({Kaohsiung},{TV})), where
A0001 is the unique identifier of T.
The basic component of a document cube is called a cell, which is defined as
follows.
Definition 6. A cell defined on n dimensions (D1,D2, . . .,Dn) is denoted
c = (tc,Xc), where tc = (c1,c2, . . .,ci, . . .,cn), ci 2 Di(0) [ {�*�}, 1 6 i 6 n, and
Xc = {x1,x2, . . .,xj, . . .,xm} is a set of base document indices of the form
xj = (idTj, (K1,K2, . . .,Kn)), where idTj is the unique identifier of some document
Tj and Ki \ Di(0) 5 ;, 1 6 i 6 n. The set of all such document unique
identifiers idTj involved in the cell c = (tc,Xc) is denoted IDðcÞ ¼ fx1j j8xj 2 Xcg.That is, a document with unique identifier in ID(c) can be directly accessedfrom the cell c.
Definition 7. A cell c = (tc,Xc), where tc = (c1,c2, . . .,ci, . . .,cn), defined on n
dimensions (D1,D2, . . .,Dn) is called an m-d cell, 0 6 m 6 n, if and only if there
are exactly m non-summary member ci (i.e., ci 5 �*�). If m = n, then c is also
called a base cell; otherwise if m < n, then c is also called a non-base cell.
Definition 8. An n-dimensional i-d cell a = ((a1,a2, . . .,an),Xa) is a parent ofanother n-dimensional j-d cell b = ((b1,b2, . . .,bn),Xb), if and only if the
following conditions hold:
![Page 9: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/9.jpg)
non-base cell
base cell
region
product
time
a
d
S
T3 T1
T2
ID(a)
ID(d)
TV
Refrigerator
Cellular Phone
Radio
Monitor
PrinterC
omputer
Com
munication
Appliance
All Product
Taipei
Taoyuan
HsinChu
Tainan
Kaohsiung
Pingtong
North
South
A ll Region
Fig. 5. A sample illustration of a document cube.
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 63
1. j = i + 1,
2. There exists exactly one k, such that ak is the parent of bk in Dk, and al = bl,
for all l 5 k, 1 6 l 6 n.
3. ID(b) � ID(a), where ID(a) and ID(b) are the sets of all document unique
identifiers involved in the cells a and b, respectively.
Definition 9. A document cube DC = (S, (D1,D2, . . .,Dn)), where S is a set of
documents defined on n dimensions (D1,D2, . . .,Dn), is a cube composed of all
cells ci = (tci,Xci
) with tci2 X16j6nDj(0) and ID(ci) � S.
A sample illustration of a document cube DC = (S, (R,P,T)) is shown in
Fig. 5, where R and P represent the aforementioned dimensions Region and
Product, respectively, and T1, T2, and T3 are documents in S. Besides, we as-sume T is a dimension representing time.
3. Basic concepts in multi-dimensional expressions
Designing a comprehensive query language for document warehousing
is challenging because document warehousing covers a wide spectrum of
![Page 10: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/10.jpg)
64 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
concepts as we have shown in Section 2. Fortunately, there is already a Multi-
Dimensional eXpression (MDX) being established for data warehousing.
Based on the constructs of MDX, we extend the constructs to include more fea-
tures as included in traditional SQL. Before proceeding to the introduction of
MDX, the concepts regarding members, tuples and sets, as well as the
MDX syntax used to construct and refer to these elements should be presentedin the first place. All items embraced with angle brackets represent non-
terminals.
3.1. Members
A member is an item in a dimension representing one or more occurrences
of keywords in documents. It may be associated with some member properties.
When describing cell data in a cube, members are the lowest level of reference.It can be regarded as one or more records in the underlying relation whose va-
lue in the corresponding column falls under a specific category. There are many
different ways to specify member names. The simplest way is just write only the
name of the member in square brackets, like [South]. If the name of a member
has no space or a number in it, then the square brackets can be omitted. Be-
sides, to resolve duplicate member names across dimensions, MDX allows us
to qualify member name with its dimension name, and the ancestor members
along the path from the dimension root to the member itself (this is called afully qualified name), such as [Region].[South].[Kaohsiung].
3.2. Tuples
A tuple is used to define a slice of objects from a document cube and can be
regarded as a vector of members. It is composed of an ordered collection of one
member from one or more dimensions and used to identify specific sections of
multi-dimensional objects from a document cube. A tuple composed of onemember from each dimension in a cube completely describes a cell value.
The syntax for specifying a tuple in MDX is
(member-of-D1,member-of-D2, . . .,member-of-Dn)
If there is only one member contained in a tuple, then the parenthesis can be
omitted.
3.3. Sets
A set in MDX is an ordered collection of zero, one or more tuples. It is most
commonly used to define axis and slicer dimensions in an MDX query, and as
such may have only a single tuple or may be, in certain cases, empty.
![Page 11: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/11.jpg)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 65
All members in the same set have to come from the same dimension (even
though they can be from different levels). Therefore, {(North), (Kaohsiung)}
(which can be abbreviated as {North, Kaohsiung}) is valid, but {(North),
(Printer)} is invalid. Note also that, sets also have dimensionality like tuples.
As a set is composed of tuples, so the dimensionality of a set is expressed by
the dimensionality of each tuple within it. Because of this, tuples within a setmust have the same dimensionality. That is, {(North, Printer), (Kaohsiung,
Computer)} is valid, but {(North, Printer), (Computer, Kaohsiung)} is not va-
lid, because order of dimensions in the tuple is reversed.
3.4. Axes
In traditional relational databases, we used to say that a relation is a two-
dimensional table. However, this is not exactly true. Actually, a relation ofdegree m should be regarded as an asymmetric table consisting of m one-
dimensional data. Rows in a relation are all of the same structure that is de-
fined by the columns, which may be of different types and have different
meanings.
In the multi-dimensional world, we can specify any number of dimensions to
form result of our query. (Practically, there are limits, of course. Usually, it is
supported up to 128 dimensions can be specified.) These dimensions in an
MDX query are called axes. Axis is a collection of dimension members, ormore generally tuples. All axes in an MDX query are perfectly symmetric,
which makes the result can be pivoted to rotate the data axes in view in order
to provide an alternative presentation of the query result. Based on the concept
of axes, we can perform slice and dice operations on a given cube, the former
performs a selection on one dimension and the latter defines a sub-cube by per-
forming a selection on two or more dimensions.
There are many versatile ways to define an axis in MDX. The simplest form
is presenting on an axis all members of certain dimension by the followingsyntax:
<Dimension_name>.MEMBERS
Similarly, if we want to see all dimension members belonging to the certain
level of a dimension, the syntax would be
<Dimension_name>.<Level_name>.MEMBERS
Besides, we can also specify a list of members in curly braces as an axis defini-
tion by the following syntax:
{member-of-D1,member-of-D2, . . .,member-of-Dn}
![Page 12: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/12.jpg)
66 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
Note that, in data warehouses, there are measures defined for values being
aggregated in the data cube. In practice, there is a default measure defined for
default display when no measures are specified. Although the measures are dif-
ferent conceptually from the dimensions when a cube was defined, they are re-garded as if a dimension with flat structure. Therefore, all the measures can
also be selected by the following syntax:
MEASURES.MEMBERS
In summary, Table 2 helps to correspond analogies between relational and
multi-dimensional terms when SQL and MDX queries are compared. Notethat in document warehouses, measures will be regarded as a set of document
pointers or the member count of the set (when using COUNT() to count the
number of documents).
3.5. Basic constructs in MDX queries
To specify a dataset, an MDX query must contain the following
information:
1. The number of axes and the members from each dimension to include on
each axis of the query. This is addressed by the SELECT clause.
2. The name of the cube that sets the context of the query. The FROM clause
is used for such purpose.
3. The members from a slicer dimension on which data is sliced for members
from the axis dimensions. This is optional. If there are slicer dimensions,
then they can be specified explicitly by using the WHERE clause.
Therefore, a basic MDX query is structured by the following three clauses:
SELECT [<axis_specification> [, <axis_specification> � � �]]
Table 2
A comparison between multi-dimensional and relational terms
Multi-dimensional term Relational term
Cube Relation
Level Attribute (string or discrete numeric)
Dimension Some related attributes in a relation
Measure Attribute (discrete or continuous numeric)
Dimension member The value in the specific row and column
corresponding to a given dimension level
Axes Projected attributes
![Page 13: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/13.jpg)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 67
FROM [<cube_specification>]
[WHERE [<slicer_specification>]]
The simplest MDX query is just specifying a SELECT clause with axes. In
such case, it is an empty SELECT clause, and the query selects just one cell,
which contains all the aggregation with all dimensions as the slicer dimensions.For general query statements, we explain these clauses in the following
subsections.
3.5.1. The SELECT clause
In MDX, the SELECT clause is used to specify a dataset containing a subset
of multi-dimensional data. Axis dimensions determine the edges of a multi-
dimensional result set. The SELECT clause is used to specify axis dimensions
by assigning a set to a specific axis. Each <axis_specification> defines one axisdimension, and thus the number of axes in the dataset equals to the number of
<axis_specification> values in the SELECT clause. Each <axis_specification>
can be broken down as follows:
<axis_specification> ::= <set>ON<axis_name><axis_name> ::= COLUMN
j ROWSj PAGESj SECTIONSj CHAPTERSj AXIS(<index>)
The <index> is the axis number. Besides, for the first five axes, AXIS(0),
AXIS(1), AXIS(2), AXIS(3), AXIS(4), AXIS(5), there are aliases, namely
COLUMNS, ROWS, PAGES, SECTIONS, and CHAPTERS can be respec-
tively used as the alternatives. It is invalid to skip axes. For example, a querycannot have an AXIS(2) without an AXIS(1) and AXIS(0).
3.5.2. The FROM clause
The FROM clause determines the cube context, the <cube_specification> is
specified to indicate the cube on which you want the MDX query to run. Un-
like SQL, the FROM clause in an MDX query usually does not allow joins on
two or more cubes. Some OLAP servers may permit the joining of cubes when
the cubes share some dimensions. However, we will not discuss such cases,since this is beyond the scope of our work.
3.5.3. The WHERE clause
The slicer dimensions in a WHERE clause is used for filtering out dimen-
sional data. A slicer dimension can accept only expressions that evaluate into
![Page 14: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/14.jpg)
68 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
a single tuple. If a set of tuples is supplied as the slicer expression, then the set
will be evaluated to aggregate the result cells in every tuple along the set.
Besides, if there are two or more measures have been defined, then, since meas-
ures are treated in MDX exactly the same as any other dimension, we may em-
ploy a WHERE clause to indicate a specific measure instead of the default
measure.
4. Basic constructs of multi-dimensional document expressions
In Section 3.5, we have briefly discussed the constructs of MDX used in data
warehouse queries. Since the measures in a data cube are all aggregated into a
single value either by SUM(), MAX(), MIN(), AVG(), or COUNT(), there is
no need to define GROUP BY, HAVING, and ORDER BY clauses inMDX. However, in document warehouses, document context is non-numerical
and cannot be aggregated into measures (except for using COUNT() to count
the number of documents). As we have defined in Definitions 8 and 9, for a
document cell c, ID(c) contains a set of document pointers pointing to a set
of documents. As the measures of a data cube can be viewed as an aggregated
target for data warehousing, the union of all ID(ci) of a document cube can be
regarded as a fetching target for document warehousing.
In this section, we will further extend the syntax of MDX into MD2X formulti-dimensional queries in document warehouses. MD2X is similar in struc-
tures to the Structured Query Language (SQL) syntax for easy integration with
the relational query processing. We will define clauses concerning GROUP BY,
HAVING, and ORDER BY as traditional SQL does. That is, a basic MD2X
query is structured by the following clauses:
SELECT [<axis_specification> [, <axis_specification> . . .]]FROM [<cube_specification>][WHERE [<slicer_specification>]]
[GROUP BY <groupby_specification>
[
HAVING <filter_specification>]ORDER BY <orderby_specification>]]
[Note that SELECT and FROM clauses are mandatory, and the other
clauses are optional. Besides, HAVING clause should be preceded withGROUP BY clause as in traditional SQL queries. Moreover, the ORDER
BY clause should also be preceded with GROUP BY clause, which is not ex-
actly as the traditional SQL.
We present a sample MD2X query consisting of only SELECT and FROM
clauses in Example 3a.
![Page 15: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/15.jpg)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 69
Example 3a. Based on the document cube illustrated in Fig. 5, we may
issue an example MD2X query consisting of SELECT and FROM clauses
only:
SELECT {P.[Appliance].[TV], P.[Appliance].[Refrigerator],
P.[Communication].[Cellular Phone], P.[Communication].[Radio],P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,{R.[North], R.[South]} ON ROWS,
FROM DC
Then, the query result may be visualized as Table 3 illustrates. If we add a
WHERE clause in the above example, as presented in Example 3b, then some
of the documents unsatisfying the <slicer_specification> may be excluded in the
query result.
Example 3b. Suppose there is a WHERE clause added in the query presented
in Example 3a:
SELECT {P.[Appliance].[TV], P.[Appliance].[Refrigerator],
P.[Communication].[Cellular Phone], P.[Communication].[Radio],
P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,{R.[North], R.[South]} ON ROWS,
FROM DCWHERE Time.[2003].[Apr]
Then, only those documents created on 2003-April are included in the query
result.
Table 3
A sample query result on document cube DC
Appliance Communication Computer
TV Refrigerator Cellular Phone Radio Monitor Printer
North
Doc 024 Doc 001
Doc 021 Doc 008 Doc 017 Doc 012 Doc 016 Doc 010 Doc 020 Doc 002
Doc 007 Doc 006
Doc 018 Doc 023 Doc 022 Doc 019
Doc 001 Doc 010 Doc 023
Doc 002 Doc 006
South
Doc 011 Doc 003 Doc 015
Doc 004 Doc 005
Doc 013 Doc 009 Doc 014
Doc 013 Doc 014
Doc 009 Doc 015
![Page 16: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/16.jpg)
70 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
4.1. The GROUP BY clause
The <groupby_specification> in GROUP BY clause is used to specify how to
group a set of document pointers according to a tuple consisting of some spe-
cific levels in some dimensions not occurred in the SELECT clause. If there is
no GROUP BY clause in an MDX query, then the situation is the same asusing GROUP BY with <groupby_specification> consisting all top levels in
the dimensions not occurred in the SELECT clause. Each <groupby_specifica-
tion> can be broken down as follows:
<groupby_specification> ::= (<Dimension_name>[.<Level_name>], . . .);
In Example 3a, the set of document pointers in each cell have no grouping
presentation. We may add a GROUP BY clause as follows to group together
document pointers pointing to documents in the same year-month.
Example 4. We extend the query in Example 3a with a GROUP BY clause as
follows.
SELECT {P.[Appliance].[TV], P.[Appliance].[Refrigerator],
P.[Communication].[Cellular Phone], P.[Communication].[Radio],P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,
{R.[North], R.[South]} ON ROWS,FROM DCGROUP BY (Time.[2003].[Month])
Then, the query result may be visualized as Table 4 depicts. The year-month
pairs in boldface indicate the document pointers in the same group pointing todocuments created in the same year-month.
4.2. The HAVING clause
Just as the slicer dimensions in a WHERE clause is used for filtering out
dimensional data, the <filter_specification> is used for eliminating groups
which do not satisfy the specified condition. In traditional SQL, the <fil-
ter_specification> must contain an aggregate function such as SUM(),MIN(), MAX(), AVG(), or COUNT(). In MD2X, if there is an aggregation
function then it must be of the form COUNT(<groupby_specification>), where
<groupby_specification> is the specification appeared in its prior GROUP BY
clause, since the other aggregate functions make no sense. Besides, the <fil-
ter_specification> can also contain a specification consisting of a set of
![Page 17: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/17.jpg)
Table 4
A sample query result with grouping on document cube DC
Appliance Communication Computer
TV Refrigerator Cellular Phone Radio Monitor Printer
Doc 002 (2003-Mar)
Doc 001 (2003-Jun)
Doc 002 (2003-Aug)
Doc 024 (2003-Mar)
Doc 007 Doc 006 (2003-Jul)
Doc 018 Doc 019 (2003-May) Doc 010
(2003-Jul) Doc 010 Doc 012 (2003-Apr)
Doc 016 Doc 017 (2003-May)
North
Doc 001 (2003-Apr)
Doc 020 Doc 021 (2003-Jun)
Doc 023 Doc 022 (2003-Jul)
Doc 023 (2003-Aug)
Doc 006 (2003-Sep)
Doc 003 (2003-Mar)
Doc 009 (2003-Mar)
Doc 009 (2003-Mar) South
Doc 011 (2003-Mar)
Doc 015 (2003-May)
Doc 004 Doc 005 (2003-May) Doc 013
Doc 014 (2003-Sep)
Doc 013 Doc 014 (2003-Jun) Doc 015
(2003-Aug)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 71
<Dimension_name>.<Member> separated by comma, such that each <Dimen-
sion_name>.<Member> is subsumed by the <groupby_specification> in its
prior GROUP BY clause. That is, the set containing elements of the form
<Dimension_name>.<Member> is used to get rid of those which do not appear
in the set. Each <filter_specification> can be broken down as follows:
<filter_specification> ::= <logical_expression>
<l
<c
j {<Dimension_name>.<Member> [, . . .]};
ogical_expression> ::= <condition>
j (<condition>)j <condition> AND <logical_expression>
j <condition> OR <logical_expression>
j NOT <logical_expression>
;ondition> ::= COUNT (<groupby_specification>)<theta_op><constant>
;
heta_op> ::= > j < j <= j >= j <> j =;
<t
Example 5. By continuing the previous example, if there is a HAVING clauseas follows:
![Page 18: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/18.jpg)
72 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
SELECT {P.[Appliance].[TV], P.[Appliance].[Refrigerator],
P.[Communication].[Cellular Phone], P.[Communication].[Radio],
P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,{R.[North], R.[South]} ON ROWS,
FROM DCGROUP BY (Time.[2003].[Month])HAVING COUNT (Time.[2003].[Month]) >= 2
Then, the query result may be visualized as Table 5 describes. Those groups
with number of elements less than 2 will be discarded.
Example 6. Alternatively, if there is a HAVING clause as follows:
SELECT {P.[Appliance].[TV], P.[Appliance].[Refrigerator],
P.[Communication].[Cellular Phone], P.[Communication].[Radio],P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,{R.[North], R.[South]} ON ROWS,
FROM DCGROUP BY (Time.[2003].[Month])HAVING {Time.[2003].[Apr], Time.[2003].[Jun], Time.[2003].[Jul]}
Then, the query result may be visualized as Table 6 describes, where onlythose documents created on 2003-April, 2003-June, and 2003-July are included.
Table 5
A sample query result with grouping and having on document cube DC
Appliance Communication Computer
TV Refrigerator Cellular Phone Radio Monitor Printer
Doc 002 (2003-Mar)
Doc 001 (2003-Jun)
Doc 002 (2003-Aug)
Doc 024 (2003-Mar)
Doc 007 Doc 006 (2003-Jul)
Doc 018 Doc 019 (2003-May) Doc 010
(2003-Jul) Doc 010 Doc 012 (2003-Apr)
Doc 016 Doc 017 (2003-May)
North
Doc 001 (2003-Apr)
Doc 020 Doc 021 (2003-Jun)
Doc 023 Doc 022 (2003-Jul)
Doc 023 (2003-Aug)
Doc 006 (2003-Sep)
Doc 003 (2003-Mar)
Doc 009 (2003-Mar)
Doc 009 (2003-Mar) South
Doc 011 (2003-Mar)
Doc 015 (2003-May)
Doc 004 Doc 005 (2003-May) Doc 013
Doc 014 (2003-Sep)
Doc 013 Doc 014 (2003-Jun) Doc 015
(2003-Aug)
![Page 19: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/19.jpg)
Table 6
Another query result with grouping and having on document cube DC
Appliance Communication Computer
TV Refrigerator Cellular Phone Radio Monitor Printer
Doc 002 (2003-Mar)
Doc 001 (2003-Jun)
Doc 002 (2003-Aug)
Doc 024 (2003-Mar)
Doc 007 Doc 006 (2003-Jul)
Doc 018 Doc 019 (2003-May) Doc 010
(2003-Jul) Doc 010 Doc 012 (2003-Apr)
Doc 016 Doc 017 (2003-May)
North
Doc 001 (2003-Apr)
Doc 020 Doc 021 (2003-Jun)
Doc 023 Doc 022 (2003-Jul)
Doc 023 (2003-Aug)
Doc 006 (2003-Sep)
Doc 003 (2003-Mar)
Doc 009 (2003-Mar)
Doc 009 (2003-Mar) South
Doc 011 (2003-Mar)
Doc 015 (2003-May)
Doc 004 Doc 005 (2003-May) Doc 013
Doc 014 (2003-Sep)
Doc 013 Doc 014 (2003-Jun) Doc 015
(2003-Aug)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 73
4.3. The ORDER BY clause
The <orderby_specification> in ORDER BY clause is used to specify how to
sort a set of document pointers according to some level of a given dimension.
The set of document pointers should be grouped first according to the
<groupby_specification> specified in the prior GROUP BY clause. Besides,
<orderby_specification> can also be any member property or dimensions cre-
ated from document metadata (e.g., such as those defined in Dublin CoreMetadata Element Set [8]), which can be used to order elements in each group.
Each <orderby_specification> can be broken down as follows (if <order> is
missing, then the default <order> is usually ASC):
<orderby_specification> ::= <Dimension_name>.<Level> [<order>]
j <Dimension_name>.<Member>.<Property> [<order>]
;
<order> ::= ASCj DESC;
Example 7. Suppose we append an ORDER BY clause in the query specified
in Example 5 as follows:
![Page 20: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/20.jpg)
Table 7
A sample query result with grouping, having, and ordering on DC
Appliance Communication Computer
TV Refrigerator Cellular Phone Radio Monitor Printer
Doc 002 (2003-Mar)
Doc 001 (2003-Jun)
Doc 002 (2003-Aug)
Doc 024 (2003-Mar)
Doc 006 Doc 007 (2003-Jul)
Doc 018 Doc 019 (2003-May) Doc 010
(2003-Jul) Doc 010 Doc 012 (2003-Apr)
Doc 016 Doc 017 (2003-May)
North
Doc 001 (2003-Apr)
Doc 020 Doc 021 (2003-Jun)
Doc 022 Doc 023 (2003-Jul)
Doc 023 (2003-Aug)
Doc 006 (2003-Sep)
Doc 003 (2003-Mar)
Doc 009 (2003-Mar)
Doc 009 (2003-Mar) South
Doc 011 (2003-Mar)
Doc 015 (2003-May)
Doc 004 Doc 005 (2003-May) Doc 013
Doc 014 (2003-Sep)
Doc 013 Doc 014 (2003-Jun) Doc 015
(2003-Aug)
74 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
SELECT {P.[Appliance].[TV], P.[Appliance].[Refrigerator],
P.[Communication].[Cellular Phone], P.[Communication].[Radio],
P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,{R.[North], R.[South]} ON ROWS,
FROM DCGROUP BY (Time.[2003].[Month])HAVING COUNT (Time.[2003].[Month]) >= 2ORDER BY Time.[2003].[Day]
Then, the query result may be visualized as Table 7 describes. Elements in
each group will be ordered by the date.
5. Putting it all together
We have discussed the necessary constructs as included in traditional SQL in
Sections 3 and 4. Here we put these constructs together to see how to evaluate
an MD2X query. The complete syntax is listed as follows. The grammar has
undefined non-terminal symbols <cube_specification> and <slicer_specifica-
tion>, where the former is completed with the name of a single document cube
and the latter is any valid tuple of the form (member-of-D1,member-of-
D2, . . .,member-of-Dn).
![Page 21: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/21.jpg)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 75
<query> ::= SELECT [<axis_specification> [, <axis_specification> . . .]]FROM [<cube_specification>]
[WHERE [<slicer_specification>]]
[GROUP BY <groupby_specification>
[HAVING <filter_specification>]
[ORDER BY <orderby_specification>]];
<axis_specification> ::= <set> ON<axis_name>;
<axis_name> ::= COLUMNj ROWSj PAGESj SECTIONSj CHAPTERSj AXIS (<index>)
;
<groupby_specification> ::= (<Dimension_name>[.<Level_name>], . . .);
<filter_specification> ::= <logical_expression>
j {<Dimension_name>.<Member> [,. . .]};
<logical_expression> ::= <condition>j (<condition>)j <condition> AND <logical_expression>
j <condition> OR <logical_expression>
j NOT <logical_expression>
;
<condition> ::= COUNT(<groupby_specification>)<theta_op><constant>;
<theta_op> ::= > j < j <= j >= j <> j =;
<orderby_specification> ::= <Dimension_name>.<Level> [<order>]
j <Dimension_name>.<Member>.<Property> [<order>]
;
<order> ::= ASCj DESC;
The evaluation order for a query containing all of the constructs is as
follows:
1. According to the FROM clause, the specified cube will be obtained by
<cube_specification>.
![Page 22: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/22.jpg)
76 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
2. Use the <slicer_specification> in the WHERE clause to slice the cube and
eliminate unnecessary cells.
3. In the third step, use the <groupby_specification> in the GROUP BY clause
to group documents into a set of groups.
4. For each group obtained, use the <filter_specification> in the HAVING
clause to filter out unnecessary groups.5. Then, use the <orderby_specification> in the ORDER BY clause to sort
documents contained in each group.
6. Finally, select the dimensions and place them on the right axes according to
the list of <axis_specification>.
6. Conclusion and future directions
6.1. Conclusion
While data warehouses and the numeric-centric business intelligence tech-
nologies have served most of the enterprises well, they do not fully address
the complete scope of business intelligence. In this paper, we advocate the
importance of constructing document warehouses to support text-centric busi-
ness intelligence, and propose a multi-dimensional query language for docu-
ment warehousing. When documents are warehoused, users can use MD2Xto perform ad hoc on-line analytical processing (OLAP) over text in a docu-
ment warehouse, which is just as the way users can perform OLAP over sum-
marized data in a data warehouse.
The applications of document warehousing are versatile. In business, docu-
ment warehousing can help administrators organize the meeting reports, gaz-
ettes, or even customer complaint e-mails, where the company personnel,
products, and time may be regarded as the dimensions, such that documents
related to some employees, or products in some time, at somewhere can be re-trieved or browsed instantly. In recent years, we have seen most of the data
warehouse applications were applied in Customer Relationship Management
(CRM), a promising trend in business affairs. However, a data warehouse cre-
ation only supports the numeric analyses of customer behaviors. To obtain the
reason of why customers buy (or did not buy) some products, we need a doc-
ument warehouse to be established. By data warehousing, users can realize
business phenomena regarding who, what, when, where, and which clearly. Nev-
ertheless, to discover why the phenomena occur, a document warehouse shouldbe employed [33].
When documents are warehoused, the task of version control will become
very easy, since users can directly tracing the documents based on some criteria
along the time dimension. Besides, document clustering can be achieved di-
rectly via visualizations. Users can also develop some document summarization
![Page 23: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/23.jpg)
Table 8
A comparison between document warehousing and data warehousing
Document warehousing Data warehousing
Similarities (1) Both have the same construction process. We may employ star schema or
snowflake [25] to design the modeling process.
(2) Both gather business document/data from heterogeneous resources.
(3) Users can do on-line analytical processing over the established result.
Differences (1) Intend to obtain text-oriented
business intelligence.
(1) Intend to obtain
numeric-oriented business
intelligence.
(2) Resources gathered from market
survey reports, project status reports,
meeting records,customer complaints,
e-mails, patent application sheets,
and advertisements of competitors.
(2) Resources gathered from
internal databases of POS
(point-of-sale) systems,
ERP (enterprise resource
planning) systems, accounting
systems, or financial
management systems.
(3) It filters out unnecessary documents
and intends to help users to address
problems regarding why.
(3) It aggregates numerical
data according to various
dimensions, and intends to
help users to address problems
regarding who, what, when, where,
and which.
(4) Enriched with text mining
techniques to summarize
documents or categorize documents.
(4) Enriched with data mining
techniques to summarize, classify,
cluster formatted data or find
the associations.
(5) Document sources should be integrated
in file systems, or native XML databases [5,6].
(5) Data sources can be integrated
in relational databases.
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 77
tools to summarize a cluster of related documents. To sum up, data warehous-
ing and document warehousing are not only one of the most infrastructure of
knowledge management, but also the kernel of customer relationship
management.
In summary, document warehousing and data warehousing are used for
respectively organizing documents and formatted data in a multi-dimensionalbasis. We compare their similarities and differences in Table 8.
6.2. Future works
In our future work, we will propose an architecture for document warehous-
ing. The preliminary components may include the following modules:
1. Employ XML Schema [21] to define document metadata. We advocate usingthe Extensible Markup Language (XML) to be the intermediate media for
document interchange.
![Page 24: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/24.jpg)
78 F.S.C. Tseng / Information Sciences 174 (2005) 55–79
2. Incorporate automatic text summarization [11–13,26], key feature extraction
[9], or even document classification and categorization [2] techniques for doc-
ument warehousing. Develop related text summarization techniques to
extract the most important 10–20% content for users to digest the docu-
ments more easily and propose how to bind a document summary with
its corresponding documents for document warehousing.3. Automatic document metadata decomposition and the mechanisms for storing
the obtained metadata into native XML or XML-enabled databases [4–6].
This helps users to manage document warehouses more efficiently.
Besides, although the dimension concepts defined in this paper are organ-
ized into hierarchical structures, however, it is assumed that when scanning a
document, the system will ignore the hierarchical relationships among key-
words in the document. Based on this work, we wish to incorporate some nat-ural language processing technologies to enhance the linguistic analysis and
annotation results of documents parsing, and elaborate the work of adopting
domain-specific ontology [7,10,29,32] with more refined concepts to be built
in the corresponding dimensions of a document cube. Ontological analysis
can help to clarify the structure of knowledge regarding a set of related docu-
ments. Given a set of related documents corresponding to a specific domain,
the ontology forms the semantic heart of any system of knowledge representa-
tion, and their document cube forms the syntactic centroid of any system ofconcept organization.
Finally, since the construction of a document warehouse has to scan a large
amount of documents, which is a task prone to time-consuming, the parallel
architecture for such process will be further investigated in the future.
References
[1] S. Anahory, D. Murray, Data Warehousing in the Real World: A Practical Guide for Building
Decision Support Systems, Addison-Wesley Longman, Harlow, England, 1997.
[2] A. Appiani, F. Cesarini, A. Colla, M. Diligenti, M. Gori, S. Marinai, G. Soda, Automatic
document classification and indexing in high-volume applications, International Journal on
Document Analysis and Recognition 4 (2) (2002) 69–83.
[3] M.J.A. Berry, G. Linoff, Data Mining Techniques: For Marketing, Sales, and Customer
Support, John Wiley & Sons, New York, 1997.
[4] E. Bertino, B. Catania, Integrating XML and databases, IEEE Internet Computing 5 (4)
(2001) 84–88.
[5] E. Bertino, E. Ferrari, XML and database integration, IEEE Internet Computing 5 (6) (2001)
75–76.
[6] M. Champion, Native XML vs. XML-Enabled: the difference makes a difference, Software
AG: The XML Company, http://www.softwareag.com/xml/library/champion_nativexml.htm.
[7] B. Chandrasekaran, J.R. Josephson, V.R. Benjamins, What are ontologies, and why do we
need them?, IEEE Intelligent Systems 14 (1) (1999) 20–26.
![Page 25: Design of a multi-dimensional query expression for document warehouses](https://reader036.vdocuments.site/reader036/viewer/2022073019/57501f631a28ab877e9576cd/html5/thumbnails/25.jpg)
F.S.C. Tseng / Information Sciences 174 (2005) 55–79 79
[8] Dublin Core Metadata Initiative, http://dublincore.org/.
[9] F.F. Feng, W.B. Croft, Probabilistic techniques for phrase extraction, Information Processing
and Management 37 (2) (2001) 199–220.
[10] N. Fridman, C.D. Hafner, The state of the art in ontology design, AI Magazine 18 (3) (1997)
53–74.
[11] J. Goldstein, M. Kantrowitz, V. Mittal, J. Carbonell, Summarizing text documents: sentence
selection and evaluation metrics, in: Proceedings of SIGIR, 1999, pp. 121–128.
[12] R. Hackathorn, Data warehousing energizes your enterprise, Datamation 1 (February) (1995)
38–42.
[13] U. Hahn, I. Mani, The challenges of automatic summarization, IEEE Computer 33 (11) (2000)
29–36.
[14] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers,
2001.
[15] http://www.applix.com.
[16] http://www.microsoft.com.
[17] http://www.microstrategy.com.
[18] http://www.sap.com.
[19] http://www.sas.com.
[20] Development snapshot: warehouse data of the future, application development trends,
February 2000, http://www.survey.com.
[21] http://www.w3.org/xml/schema.
[22] http://www.whitelight.com.
[23] IBM Corporation, Intelligent Miner for Text: Text Analysis Tools version 2.10.0, http://www-
3.ibm.com/software/data/iminer/fortext/.
[24] W.H. Inmon, Building the Data Warehouse, John Wiley & Sons, New York, NY, 1993.
[25] R. Kimball, The Data Warehouse Toolkit: Practical Techniques for Building Dimensional
Data Warehouses, John Wiley & Sons, 1996.
[26] K. Knight, Mining online text, Communications of the ACM 42 (11) (1999).
[27] S.-H. Lin, C.-S. Shih, M.C. Chen, J.-M. Ho, M.-T. Ko, Y.-M. Huang, Extracting classification
knowledge of internet documents with mining term associations: a semantic approach, in:
Proceedings of SIGIR, 1998.
[28] S. Loh, L.K. Wives, J.P. de Oliverira, Concept-based knowledge discovery in texts extracted
from the web, SIGKDD Explorations 2 (1) (2000).
[29] G.A. Miller, Wordnet: an online lexical database, International Journal of Lexicography 3 (4)
(1990) 235–312.
[30] Oracle Corporation, InterMedia Text 8.1.6, http://otn.oracle.com/products/text/x/Tech_Over-
views/imt_817.html.
[31] G. Spofford, MDX Solutions—With Microsoft SQL Server Analysis Services, John Wiley &
Sons, 2001.
[32] V. Sugumaran, V.C. Storey, Ontologies for conceptual modeling: their creation, use, and
management, Data and Knowledge Engineering 42 (2002) 251–271.
[33] D. Sullivan, Document Warehousing and Text Mining: Techniques for Improving Business
Operations, Marketing and Sales, John Wiley & Son, 2001.
[34] A.-H. Tan, Text Mining: the state of the art and the challenges, in: Proceedings of PAKDD
99—Workshop on Knowledge Discovery from Advanced Databases, Beijing, 1999, pp. 50–70.
[35] F.S.C. Tseng, W.P. Lin, A study on indexing structure and its properties for constructing
document warehouses, in: Proceedings of The 20th Workshop on Combinatorial Mathematics
and Computation Theory, Chiayi, Taiwan, August 2003, pp. 18–27.