design of a multi-dimensional query expression for document warehouses

Information Sciences 174 (2005) 55–79

www.elsevier.com/locate/ins

Design of a multi-dimensionalquery expression for document warehousesq

Frank S.C. Tseng

Department of Information Management, National Kaohsiung First University

of Science and Technology, Kaohsiung 811, Taiwan, ROC

Received 16 November 2003; received in revised form 25 August 2004; accepted 26 August 2004

Abstract

During the past decade, data warehousing has been widely adopted in the business

community. It provides multi-dimensional analyses on cumulated historical business

data for helping contemporary administrative decision-makings. However, many data

warehousing query language in present only provides on-line analytical processing

(OLAP) for numeric data. For example, MDX (Multi-Dimensional eXpressions) has

been proposed as a query language to allow describing multi-dimensional queries over

databases with OLAP capabilities. Nevertheless, it is believed there is only about 20%

information can be extracted from data warehouses concerning numeric data only,

the other 80% information is hidden in non-numeric data or even in documents. There-

fore, many researchers now advocate it is time to conduct research works on document

warehousing to capture complete business intelligence. Document warehouses, unlike

traditional document management systems, include extensive semantic information

about documents, cross-document feature relations, and document grouping or cluster-

ing to provide a more accurate and more efficient access to text-oriented business intel-

ligence. In this paper, we extend the structure of MDX into a new one containing

0020-0255/$ - see front matter � 2004 Elsevier Inc. All rights reserved.

doi:10.1016/j.ins.2004.08.010

q This research was partially supported by National Science Council, Taiwan, ROC, under

Contract No. NSC-91-2416-H-327-005.

E-mail address: [email protected]

mailto:[email protected]

56 F.S.C. Tseng / Information Sciences 174 (2005) 55–79

complete constructs for querying document warehouses. The traditional MDX only

contains SELECT, FROM, and WHERE clauses, which is not rich enough for docu-

ment warehousing. In this paper, we present how to extend the language constructs

to include GROUP BY, HAVING, and ORDER BY to design an SQL-like query lan-

guage for document warehousing. The work is essential for establishing an infrastruc-

ture to help combining text processing with numeric OLAP processing technologies.

Hopefully, the combination of data warehousing and document warehousing will be

one of the most important kernels of knowledge management and customer relationship

management applications.

� 2004 Elsevier Inc. All rights reserved.

Keywords: Data warehousing; Document warehousing; Knowledge management; Multi-Dimen-

sional eXpressions (MDX); OLAP

1. Introduction

Data warehousing [24] and data mining techniques [14] are gaining in

popularity as organizations realize the benefits of being able to perform mul-

ti-dimensional analyses of cumulated historical business data to help contem-

porary administrative decision-making [1,3,14,25,12]. However, much of theefforts have only touched the tip of the information iceberg. It is believed that

[20], for the business intelligence of an enterprise, there are only about 20%

information can be extracted from formatted data stored in relational dat-

abases. The remaining 80% information is hidden in unstructured or semi-

structured documents. For instances, market survey reports, project status

reports, meeting records, customer complaints, e-mails, patent application

sheets, and advertisements of competitors are all recorded in documents. De-

spite that, documents in the Web, enterprise repositories, and public documentmanagement systems are all growing as well. Therefore, knowledge workers,

managers, and executives still have to spend much of the working moment

reading dozens, if not hundreds, of various types of electronic documents

spread over the Internet. There is just too much text to digest in daily life.

The fast-growing and tremendous amount of documents has far exceeded hu-

man�s ability for comprehension without powerful tools. As a result, it is not

only some relevant documents may be ignored being taken into account, but

some irrelevant documents may also be considered by intuition when doingsome important decision-makings. We believe that leaving out information in-

duced from relevant documents or keeping information by intuitively guessing

from irrelevant documents may be detrimental, causing disaster from the strat-

egy weaved by incomplete information.

To alleviate such phenomenon, Sullivan [33] has advocated that documents

should be properly warehoused according to some well-defined concepts for

F.S.C. Tseng / Information Sciences 174 (2005) 55–79 57

expanding the scope of business intelligence to include textual information.

Hence, one of the next challenges in information community will be the study

of topics about document warehousing and text mining to help enterprises on

obtaining the complete business intelligence. Although research works regard-

ing text mining have been conducted widely (for example, the gentle readers are

referred to [23,26–28,30,34]), however, the issues regarding document ware-house are rarely addressed. We have proposed a multi-dimensional indexing

structure, called D-tree [35], for constructing document warehouses. In this

work, we will focus our work on the design of a query language for querying

document warehouses.

The importance of the design of a good query language for document ware-

housing can be seen from observing the history of relational database systems.

The standardization of relational query languages, SQL (Structured Query

Language), is widely accepted and credited for the success of the relationaldatabase systems in the past decades. After the appearance of data warehous-

ing, applications that provide users with multi-dimensional analysis of data are

becoming widespread, and in a few years may even be considered pervasive.

Although SQL can be used to handle basic analytical operations on top of rela-

tional databases, it is cumbersome and lacks of important semantics for multi-

dimensional analysis. Based on this observation, MDX (Multi-Dimensional

eXpressions) [31] was proposed to allow describing multi-dimensional queries

over databases or data warehouses with OLAP (On-Line Analytical Process-ing) capabilities. MDX has been widely adopted and supported by Applix

[15], Microsoft [16], Microstrategy [17], Whitelight [22], SAS [19], and SAP

[18].

Since there are usually many diverse concepts involved in a document, a

document is also multi-dimensional in nature. Document warehouses, unlike

traditional document management systems, include extensive semantic infor-

mation about documents, cross-document feature relations, and document

grouping or clustering to provide a more accurate and more efficient accessto text-oriented business intelligence. To facilitate flexible and effective

multi-dimensional on-line analytical document processing and browsing, a

multi-dimensional query language for querying document warehouses is indis-

pensable. A good query language for document warehousing may help stand-

ardize the development of platforms for document warehousing and text

mining systems.

In this paper, we extend the structure of MDX into a new one, called MD2X

(Multi-Dimensional Document Expressions), containing complete SQL-likeconstructs for querying document warehouses. The traditional MDX only con-

tains SELECT, FROM, and WHERE clauses. Although this is rich enough for

data warehouse queries, however, it is not rich enough for document ware-

house queries. We will present how to extend the language constructs into

MD2X to include GROUP BY, HAVING, and ORDER BY clauses to adopt


an SQL-like syntax and reflect the complete language structure of the tradi-

tional SQL.

The work is essential for establishing an infrastructure to help combining

text processing with numeric OLAP processing technologies. Hopefully, the

combination of data warehousing and document warehousing will be one of

the most important kernels of knowledge management and customer relation-ship management applications.

As the Web applications proliferate tremendously, there will be a great deal

of needs on rapid text processing and browsing. Document warehousing does

not only provide an infrastructure for developing tools for business executives

to systematically organize, understand, and properly categorize their docu-

ments to help strategic decision-makings, but also integrate all kinds of related

documents being browsed instantly.

Document warehousing also provides an important platform for on-lineanalytical processing (OLAP) in text level for the interactive analysis of

multi-dimensional documents of various granularities, which facilitates effec-

tive text mining, integrates documents into the business intelligence infrastruc-

ture, and provides the means to search for and target specific information the

way we now do with numeric data. Furthermore, as the construction of data

warehouses can be viewed as an important step for data mining, the construc-

tion of document warehouses can be regarded as an indispensable preprocess-

ing step for text mining. We realize that, no matter how wonderful themechanism a system adopts, it cannot do much without a good content organ-

ization of the domain on which it is to work. Moreover, we often recognize

that, once a good content organization is available, many different mechanisms

might be employed equally well to implement effective systems. A well-organ-

ized document warehouse just provides various mechanisms a wonderful con-

tent organization to work on.

Our paper is organized as follows. In Section 2, the important concepts of

document warehousing are formally presented. Based on these concepts, we ex-tend MDX to the MD2X structures for querying document warehouses in Sec-

tions 3 and 4. Then, in Section 5, we put the necessary constructs all together to

form an SQL-like multi-dimensional query language and discuss the evaluation

steps for document warehousing. Finally, we conclude and propose some fu-

ture works in Section 6.

2. An introduction to document warehousing

In the following, we give some definitions about dimensions, document tu-

ples, and document cubes for document warehousing. Then, based on these

definitions, the formal definition of MD2X will be defined in the following

sections.


Definition 1. A dimension D is a tree structure of m levels, m P 1, which is used

for representing the hierarchical relationships among a set of keywords. A node

in a dimension D is called a member, and each internal node contains a special

child called summary member, denoted �*�, which is used for denoting the total

concept of the other children of the internal node.

When drawing a dimension, we usually leave out a summary member, since

it has the same meaning with its parent node. Besides, the keywords in a dimen-

sion are not limited to only those contained in document contents. Any prop-

erty or metadata of a document file (e.g., such as those defined in Dublin Core

Metadata Element Set [8]) can also be regarded as a keyword in a dimension

for constructing document cubes. That is, according to the keyword sources,

dimensions can be distinguished into the following two types:

1. Metadata dimension. A dimension contains keywords used for scanning doc-

ument file properties or metadata. For example, in Dublin Core Metadata

Element Set, there are title, creator, subject, description, publisher, contrib-

utor, date, type, format, identifier, source, language, relation, coverage, and

rights, all can be regarded as keywords in their corresponding metadata

dimensions.

2. Ordinary dimension. A dimension contains keywords used for scanning the

document contents.

To simplify our discussion, we mainly use ordinary dimensions, together

with the metadata dimension time (i.e., date), in the following examples.

Definition 2. For a dimension D, the ith-level member set, denoted D(i), is

defined as D(i) = {aj a is a member in the ith level of D, but a is not a summary

member}. Besides, we use D(0) to denote the union of all non-summary

members in D, which is the union of all ith level member sets in D. That is,D(0) = ¨16i6hD(i), where h is the height of D. In practice, each D(i) has a

specific name, which will be called the ith-level name.

Practically, a dimension can be constructed from a relational table, with

each level corresponds to an attribute in the relation and the attribute names

are usually used as the corresponding level names. To illustrate the above def-

initions, we give an example as follows. Besides, any keyword in a dimension

can be implemented as a set of synonyms to encompass more semantics.

Example 1. Suppose there is a relation Region representing the regions of

Taiwan as shown in Table 1. This relation can be used to construct a

dimension, denoted R as depicted in Fig. 1, where the first level corresponds to

the dimension itself (which is commonly denoted �(All Region)�), and the second

Table 1

A relation Region for constructing dimension R

Region

� � � Location City � � �� North Taipei � � �� North Taoyun � � �� North Hsinchu � � �� South Tainan � � �� South Kaohsiung � � �� South Pingtong � � �

Level

All Region

North South*

Taipei Taoyun Hsinchu Tainan Kaohssiung Pingtong* *

1

2

3

Fig. 1. An illustration of dimension R.


and third levels are derived from the attributes Location, and City, respectively.

All nodes with label �*� are summary members. That is, the summary memberin the second level has the same meaning with all regions in Taiwan, which

represents {South, North}. Besides, the summary members under South and

North have the same corresponding meaning with South and North, which

denote {Tainan, Kaohsiung, Pingtong} and {Taipei, Taoyun, Hsinchu}, respec-

tively. By omitting all the summary members, Fig. 1 is redrawn in Fig. 2.

According to the illustration of dimension R, we know that R(1) = {(All

All Region

North South

Taipei Taoyun Hsinchu Tainan Kaohssiung Pingtong

Level

1

2

3

Level Name

All Region

Location

City

Fig. 2. A concise illustration of dimension R.

All Product

Applicance Computer

TV Refrigerator CellularPhone

Radio Monitor Printer

Level

1

2

3

Communication

Fig. 3. A concise illustration of dimension P.


Region)}, R(2) = {South, North}, and R(3) = {Tainan, Kaohsiung, Pingtong,

Taipei, Taoyun, Hsinchu}, and R(0) = {(All Region), South, North, Tainan,

Kaohsiung, Pingtong, Taipei, Taoyun, Hsinchu}. In Fig. 3, another dimension,

denoted P, representing the products of a company manufacturing consumerelectronics is concisely depicted. Both dimensions will be used in the following

examples.

For a dimension D, there are two basic operations called drill-down and

roll-up, which are formally defined as follows.

Definition 3. For a dimension D, expanding an internal node to obtain all of its

children is called drill-down, and shrinking a set of children to obtain theircommon parent is called roll-up.

By rolling up and drilling down, users can browse a document cube from

different perspectives, obtaining further insight into relationships among docu-

ments. This can be further clarified by the following definitions.

Definition 4. For any two n-tuple of keywords A = (a1,a2, . . .,ai, . . .,an) and

B = (b1,b2, . . .,bi, . . .,bn) defined on n dimensions (D1,D2, . . .,Di, . . .,Dn), whereai and bi 2 Di(0), we define B is a member of drilling down A along dimension Di,

denoted A �i B, if and only if there exists exactly an i, 1 6 i 6 n, such that bi is

a child of ai in Di, and bj = aj, for all j 5 i.

Definition 5. For a document T with unique identifier idT, a document index of

T defined on n dimensions (D1,D2, . . .,Dn) is denoted x = (idT,KT), where

KT = (K1,K2, . . .,Ki, . . .,Kn) is an n-tuple of keyword sets, such that each Ki

contains a set of keywords, and for all keywords tij 2 Ki, tij occurred in T andtij 2 Di(0), for all 1 6 i 6 n. In the following, the first and second components

To whom it may concern: We have bought a TV from your Kaohsiung branch last weekend. However, we found the screen is severely unstable. Please give us the phone number of your service center. Thank you for your kindly help. Sincerely, Frank S.C. Tseng

Fig. 4. A complaint e-mail issued by a customer (A0001).


of a document index x = (idT,KT) will be denoted x1 and x2 (i.e., x1 = idT andx2 = KT), respectively. When all jKij = 1, the document index is also called a

base document index, and each Ki can also be denoted by its only element for

convenience. (That is, in such cases, a KT = ({t1},{t2}, . . ., {ti}, . . ., {tn}) can be

abbreviated as KT = (t1, t2, . . ., ti, . . ., tn).) If there are at lease one Ki, such that

jKij > 1, and the sizes of the other Kj�s all equal to 1, then the document index

is also called a composite document index. Finally, if there are some Ki, such

that jKij = 0, then the document index is also called a degenerate document

index.

Example 2. Suppose there is a complaint e-mail issued from a customer as

shown in Fig. 4. Then, a base document index of T defined on the above two

dimensions (R,P) can be obtained as x = (A0001, ({Kaohsiung},{TV})), where

A0001 is the unique identifier of T.

The basic component of a document cube is called a cell, which is defined as

follows.

Definition 6. A cell defined on n dimensions (D1,D2, . . .,Dn) is denoted

c = (tc,Xc), where tc = (c1,c2, . . .,ci, . . .,cn), ci 2 Di(0) [ {�*�}, 1 6 i 6 n, and

Xc = {x1,x2, . . .,xj, . . .,xm} is a set of base document indices of the form

xj = (idTj, (K1,K2, . . .,Kn)), where idTj is the unique identifier of some document

Tj and Ki \ Di(0) 5 ;, 1 6 i 6 n. The set of all such document unique

identifiers idTj involved in the cell c = (tc,Xc) is denoted IDðcÞ ¼ fx1j j8xj 2 Xcg.That is, a document with unique identifier in ID(c) can be directly accessedfrom the cell c.

Definition 7. A cell c = (tc,Xc), where tc = (c1,c2, . . .,ci, . . .,cn), defined on n

dimensions (D1,D2, . . .,Dn) is called an m-d cell, 0 6 m 6 n, if and only if there

are exactly m non-summary member ci (i.e., ci 5 �*�). If m = n, then c is also

called a base cell; otherwise if m < n, then c is also called a non-base cell.

Definition 8. An n-dimensional i-d cell a = ((a1,a2, . . .,an),Xa) is a parent ofanother n-dimensional j-d cell b = ((b1,b2, . . .,bn),Xb), if and only if the

following conditions hold:

non-base cell

base cell

region

product

time

a

d

S

T3 T1

T2

ID(a)

ID(d)

TV

Refrigerator

Cellular Phone

Radio

Monitor

PrinterC

omputer

Com

munication

Appliance

All Product

Taipei

Taoyuan

HsinChu

Tainan

Kaohsiung

Pingtong

North

South

A ll Region

Fig. 5. A sample illustration of a document cube.


1. j = i + 1,

2. There exists exactly one k, such that ak is the parent of bk in Dk, and al = bl,

for all l 5 k, 1 6 l 6 n.

3. ID(b) � ID(a), where ID(a) and ID(b) are the sets of all document unique

identifiers involved in the cells a and b, respectively.

Definition 9. A document cube DC = (S, (D1,D2, . . .,Dn)), where S is a set of

documents defined on n dimensions (D1,D2, . . .,Dn), is a cube composed of all

cells ci = (tci,Xci

) with tci2 X16j6nDj(0) and ID(ci) � S.

A sample illustration of a document cube DC = (S, (R,P,T)) is shown in

Fig. 5, where R and P represent the aforementioned dimensions Region and

Product, respectively, and T1, T2, and T3 are documents in S. Besides, we as-sume T is a dimension representing time.

3. Basic concepts in multi-dimensional expressions

Designing a comprehensive query language for document warehousing

is challenging because document warehousing covers a wide spectrum of


concepts as we have shown in Section 2. Fortunately, there is already a Multi-

Dimensional eXpression (MDX) being established for data warehousing.

Based on the constructs of MDX, we extend the constructs to include more fea-

tures as included in traditional SQL. Before proceeding to the introduction of

MDX, the concepts regarding members, tuples and sets, as well as the

MDX syntax used to construct and refer to these elements should be presentedin the first place. All items embraced with angle brackets represent non-

terminals.

3.1. Members

A member is an item in a dimension representing one or more occurrences

of keywords in documents. It may be associated with some member properties.

When describing cell data in a cube, members are the lowest level of reference.It can be regarded as one or more records in the underlying relation whose va-

lue in the corresponding column falls under a specific category. There are many

different ways to specify member names. The simplest way is just write only the

name of the member in square brackets, like [South]. If the name of a member

has no space or a number in it, then the square brackets can be omitted. Be-

sides, to resolve duplicate member names across dimensions, MDX allows us

to qualify member name with its dimension name, and the ancestor members

along the path from the dimension root to the member itself (this is called afully qualified name), such as [Region].[South].[Kaohsiung].

3.2. Tuples

A tuple is used to define a slice of objects from a document cube and can be

regarded as a vector of members. It is composed of an ordered collection of one

member from one or more dimensions and used to identify specific sections of

multi-dimensional objects from a document cube. A tuple composed of onemember from each dimension in a cube completely describes a cell value.

The syntax for specifying a tuple in MDX is

(member-of-D1,member-of-D2, . . .,member-of-Dn)

If there is only one member contained in a tuple, then the parenthesis can be

omitted.

3.3. Sets

A set in MDX is an ordered collection of zero, one or more tuples. It is most

commonly used to define axis and slicer dimensions in an MDX query, and as

such may have only a single tuple or may be, in certain cases, empty.


All members in the same set have to come from the same dimension (even

though they can be from different levels). Therefore, {(North), (Kaohsiung)}

(which can be abbreviated as {North, Kaohsiung}) is valid, but {(North),

(Printer)} is invalid. Note also that, sets also have dimensionality like tuples.

As a set is composed of tuples, so the dimensionality of a set is expressed by

the dimensionality of each tuple within it. Because of this, tuples within a setmust have the same dimensionality. That is, {(North, Printer), (Kaohsiung,

Computer)} is valid, but {(North, Printer), (Computer, Kaohsiung)} is not va-

lid, because order of dimensions in the tuple is reversed.

3.4. Axes

In traditional relational databases, we used to say that a relation is a two-

dimensional table. However, this is not exactly true. Actually, a relation ofdegree m should be regarded as an asymmetric table consisting of m one-

dimensional data. Rows in a relation are all of the same structure that is de-

fined by the columns, which may be of different types and have different

meanings.

In the multi-dimensional world, we can specify any number of dimensions to

form result of our query. (Practically, there are limits, of course. Usually, it is

supported up to 128 dimensions can be specified.) These dimensions in an

MDX query are called axes. Axis is a collection of dimension members, ormore generally tuples. All axes in an MDX query are perfectly symmetric,

which makes the result can be pivoted to rotate the data axes in view in order

to provide an alternative presentation of the query result. Based on the concept

of axes, we can perform slice and dice operations on a given cube, the former

performs a selection on one dimension and the latter defines a sub-cube by per-

forming a selection on two or more dimensions.

There are many versatile ways to define an axis in MDX. The simplest form

is presenting on an axis all members of certain dimension by the followingsyntax:

<Dimension_name>.MEMBERS

Similarly, if we want to see all dimension members belonging to the certain

level of a dimension, the syntax would be

<Dimension_name>.<Level_name>.MEMBERS

Besides, we can also specify a list of members in curly braces as an axis defini-

tion by the following syntax:

{member-of-D1,member-of-D2, . . .,member-of-Dn}


Note that, in data warehouses, there are measures defined for values being

aggregated in the data cube. In practice, there is a default measure defined for

default display when no measures are specified. Although the measures are dif-

ferent conceptually from the dimensions when a cube was defined, they are re-garded as if a dimension with flat structure. Therefore, all the measures can

also be selected by the following syntax:

MEASURES.MEMBERS

In summary, Table 2 helps to correspond analogies between relational and

multi-dimensional terms when SQL and MDX queries are compared. Notethat in document warehouses, measures will be regarded as a set of document

pointers or the member count of the set (when using COUNT() to count the

number of documents).

3.5. Basic constructs in MDX queries

To specify a dataset, an MDX query must contain the following

information:

1. The number of axes and the members from each dimension to include on

each axis of the query. This is addressed by the SELECT clause.

2. The name of the cube that sets the context of the query. The FROM clause

is used for such purpose.

3. The members from a slicer dimension on which data is sliced for members

from the axis dimensions. This is optional. If there are slicer dimensions,

then they can be specified explicitly by using the WHERE clause.

Therefore, a basic MDX query is structured by the following three clauses:

SELECT [<axis_specification> [, <axis_specification> � � �]]

Table 2

A comparison between multi-dimensional and relational terms

Multi-dimensional term Relational term

Cube Relation

Level Attribute (string or discrete numeric)

Dimension Some related attributes in a relation

Measure Attribute (discrete or continuous numeric)

Dimension member The value in the specific row and column

corresponding to a given dimension level

Axes Projected attributes


FROM [<cube_specification>]

[WHERE [<slicer_specification>]]

The simplest MDX query is just specifying a SELECT clause with axes. In

such case, it is an empty SELECT clause, and the query selects just one cell,

which contains all the aggregation with all dimensions as the slicer dimensions.For general query statements, we explain these clauses in the following

subsections.

3.5.1. The SELECT clause

In MDX, the SELECT clause is used to specify a dataset containing a subset

of multi-dimensional data. Axis dimensions determine the edges of a multi-

dimensional result set. The SELECT clause is used to specify axis dimensions

by assigning a set to a specific axis. Each <axis_specification> defines one axisdimension, and thus the number of axes in the dataset equals to the number of

<axis_specification> values in the SELECT clause. Each <axis_specification>

can be broken down as follows:

<axis_specification> ::= <set>ON<axis_name><axis_name> ::= COLUMN

j ROWSj PAGESj SECTIONSj CHAPTERSj AXIS(<index>)

The <index> is the axis number. Besides, for the first five axes, AXIS(0),

AXIS(1), AXIS(2), AXIS(3), AXIS(4), AXIS(5), there are aliases, namely

COLUMNS, ROWS, PAGES, SECTIONS, and CHAPTERS can be respec-

tively used as the alternatives. It is invalid to skip axes. For example, a querycannot have an AXIS(2) without an AXIS(1) and AXIS(0).

3.5.2. The FROM clause

The FROM clause determines the cube context, the <cube_specification> is

specified to indicate the cube on which you want the MDX query to run. Un-

like SQL, the FROM clause in an MDX query usually does not allow joins on

two or more cubes. Some OLAP servers may permit the joining of cubes when

the cubes share some dimensions. However, we will not discuss such cases,since this is beyond the scope of our work.

3.5.3. The WHERE clause

The slicer dimensions in a WHERE clause is used for filtering out dimen-

sional data. A slicer dimension can accept only expressions that evaluate into


a single tuple. If a set of tuples is supplied as the slicer expression, then the set

will be evaluated to aggregate the result cells in every tuple along the set.

Besides, if there are two or more measures have been defined, then, since meas-

ures are treated in MDX exactly the same as any other dimension, we may em-

ploy a WHERE clause to indicate a specific measure instead of the default

measure.

4. Basic constructs of multi-dimensional document expressions

In Section 3.5, we have briefly discussed the constructs of MDX used in data

warehouse queries. Since the measures in a data cube are all aggregated into a

single value either by SUM(), MAX(), MIN(), AVG(), or COUNT(), there is

no need to define GROUP BY, HAVING, and ORDER BY clauses inMDX. However, in document warehouses, document context is non-numerical

and cannot be aggregated into measures (except for using COUNT() to count

the number of documents). As we have defined in Definitions 8 and 9, for a

document cell c, ID(c) contains a set of document pointers pointing to a set

of documents. As the measures of a data cube can be viewed as an aggregated

target for data warehousing, the union of all ID(ci) of a document cube can be

regarded as a fetching target for document warehousing.

In this section, we will further extend the syntax of MDX into MD2X formulti-dimensional queries in document warehouses. MD2X is similar in struc-

tures to the Structured Query Language (SQL) syntax for easy integration with

the relational query processing. We will define clauses concerning GROUP BY,

HAVING, and ORDER BY as traditional SQL does. That is, a basic MD2X

query is structured by the following clauses:

SELECT [<axis_specification> [, <axis_specification> . . .]]FROM [<cube_specification>][WHERE [<slicer_specification>]]

[GROUP BY <groupby_specification>

[
HAVING <filter_specification>]
ORDER BY <orderby_specification>]]
[
Note that SELECT and FROM clauses are mandatory, and the other

clauses are optional. Besides, HAVING clause should be preceded withGROUP BY clause as in traditional SQL queries. Moreover, the ORDER

BY clause should also be preceded with GROUP BY clause, which is not ex-

actly as the traditional SQL.

We present a sample MD2X query consisting of only SELECT and FROM

clauses in Example 3a.


Example 3a. Based on the document cube illustrated in Fig. 5, we may

issue an example MD2X query consisting of SELECT and FROM clauses

only:

SELECT {P.[Appliance].[TV], P.[Appliance].[Refrigerator],

P.[Communication].[Cellular Phone], P.[Communication].[Radio],P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,{R.[North], R.[South]} ON ROWS,

FROM DC

Then, the query result may be visualized as Table 3 illustrates. If we add a

WHERE clause in the above example, as presented in Example 3b, then some

of the documents unsatisfying the <slicer_specification> may be excluded in the

query result.

Example 3b. Suppose there is a WHERE clause added in the query presented

in Example 3a:


P.[Communication].[Cellular Phone], P.[Communication].[Radio],

P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,{R.[North], R.[South]} ON ROWS,

FROM DCWHERE Time.[2003].[Apr]

Then, only those documents created on 2003-April are included in the query

result.

Table 3

A sample query result on document cube DC

Appliance Communication Computer

TV Refrigerator Cellular Phone Radio Monitor Printer

North

Doc 024 Doc 001

Doc 021 Doc 008 Doc 017 Doc 012 Doc 016 Doc 010 Doc 020 Doc 002

Doc 007 Doc 006

Doc 018 Doc 023 Doc 022 Doc 019

Doc 001 Doc 010 Doc 023

Doc 002 Doc 006

South

Doc 011 Doc 003 Doc 015

Doc 004 Doc 005

Doc 013 Doc 009 Doc 014

Doc 013 Doc 014

Doc 009 Doc 015


4.1. The GROUP BY clause

The <groupby_specification> in GROUP BY clause is used to specify how to

group a set of document pointers according to a tuple consisting of some spe-

cific levels in some dimensions not occurred in the SELECT clause. If there is

no GROUP BY clause in an MDX query, then the situation is the same asusing GROUP BY with <groupby_specification> consisting all top levels in

the dimensions not occurred in the SELECT clause. Each <groupby_specifica-

tion> can be broken down as follows:

<groupby_specification> ::= (<Dimension_name>[.<Level_name>], . . .);

In Example 3a, the set of document pointers in each cell have no grouping

presentation. We may add a GROUP BY clause as follows to group together

document pointers pointing to documents in the same year-month.

Example 4. We extend the query in Example 3a with a GROUP BY clause as

follows.

P.[Computer].[Monitor], P.[Computer].[Printer]} ON COLUMNS,

{R.[North], R.[South]} ON ROWS,FROM DCGROUP BY (Time.[2003].[Month])

Then, the query result may be visualized as Table 4 depicts. The year-month

pairs in boldface indicate the document pointers in the same group pointing todocuments created in the same year-month.

4.2. The HAVING clause

Just as the slicer dimensions in a WHERE clause is used for filtering out

dimensional data, the <filter_specification> is used for eliminating groups

which do not satisfy the specified condition. In traditional SQL, the <fil-

ter_specification> must contain an aggregate function such as SUM(),MIN(), MAX(), AVG(), or COUNT(). In MD2X, if there is an aggregation

function then it must be of the form COUNT(<groupby_specification>), where

<groupby_specification> is the specification appeared in its prior GROUP BY

clause, since the other aggregate functions make no sense. Besides, the <fil-

ter_specification> can also contain a specification consisting of a set of

Table 4

A sample query result with grouping on document cube DC



Doc 002 (2003-Mar)

Doc 001 (2003-Jun)

Doc 002 (2003-Aug)

Doc 024 (2003-Mar)

Doc 007 Doc 006 (2003-Jul)

Doc 018 Doc 019 (2003-May) Doc 010

(2003-Jul) Doc 010 Doc 012 (2003-Apr)

Doc 016 Doc 017 (2003-May)

North

Doc 001 (2003-Apr)

Doc 020 Doc 021 (2003-Jun)

Doc 023 Doc 022 (2003-Jul)

Doc 023 (2003-Aug)

Doc 006 (2003-Sep)

Doc 003 (2003-Mar)

Doc 009 (2003-Mar)

Doc 009 (2003-Mar) South

Doc 011 (2003-Mar)

Doc 015 (2003-May)

Doc 004 Doc 005 (2003-May) Doc 013

Doc 014 (2003-Sep)

Doc 013 Doc 014 (2003-Jun) Doc 015

(2003-Aug)


<Dimension_name>.<Member> separated by comma, such that each <Dimen-

sion_name>.<Member> is subsumed by the <groupby_specification> in its

prior GROUP BY clause. That is, the set containing elements of the form

<Dimension_name>.<Member> is used to get rid of those which do not appear

in the set. Each <filter_specification> can be broken down as follows:

<filter_specification> ::= <logical_expression>

<l

<c

j {<Dimension_name>.<Member> [, . . .]};

ogical_expression> ::= <condition>

j (<condition>)j <condition> AND <logical_expression>

j <condition> OR <logical_expression>

j NOT <logical_expression>

;ondition> ::= COUNT (<groupby_specification>)<theta_op><constant>

;

heta_op> ::= > j < j <= j >= j <> j =;

<t

Example 5. By continuing the previous example, if there is a HAVING clauseas follows:





FROM DCGROUP BY (Time.[2003].[Month])HAVING COUNT (Time.[2003].[Month]) >= 2

Then, the query result may be visualized as Table 5 describes. Those groups

with number of elements less than 2 will be discarded.

Example 6. Alternatively, if there is a HAVING clause as follows:


FROM DCGROUP BY (Time.[2003].[Month])HAVING {Time.[2003].[Apr], Time.[2003].[Jun], Time.[2003].[Jul]}

Then, the query result may be visualized as Table 6 describes, where onlythose documents created on 2003-April, 2003-June, and 2003-July are included.

Table 5

A sample query result with grouping and having on document cube DC



Doc 002 (2003-Mar)

Doc 001 (2003-Jun)

Doc 002 (2003-Aug)

Doc 024 (2003-Mar)

Doc 007 Doc 006 (2003-Jul)

Doc 018 Doc 019 (2003-May) Doc 010

(2003-Jul) Doc 010 Doc 012 (2003-Apr)

Doc 016 Doc 017 (2003-May)

North

Doc 001 (2003-Apr)

Doc 020 Doc 021 (2003-Jun)

Doc 023 Doc 022 (2003-Jul)

Doc 023 (2003-Aug)

Doc 006 (2003-Sep)

Doc 003 (2003-Mar)

Doc 009 (2003-Mar)


Doc 011 (2003-Mar)

Doc 015 (2003-May)

Doc 004 Doc 005 (2003-May) Doc 013

Doc 014 (2003-Sep)

Doc 013 Doc 014 (2003-Jun) Doc 015

(2003-Aug)

Table 6

Another query result with grouping and having on document cube DC



Doc 002 (2003-Mar)

Doc 001 (2003-Jun)

Doc 002 (2003-Aug)

Doc 024 (2003-Mar)

Doc 007 Doc 006 (2003-Jul)

Doc 018 Doc 019 (2003-May) Doc 010

(2003-Jul) Doc 010 Doc 012 (2003-Apr)

Doc 016 Doc 017 (2003-May)

North

Doc 001 (2003-Apr)

Doc 020 Doc 021 (2003-Jun)

Doc 023 Doc 022 (2003-Jul)

Doc 023 (2003-Aug)

Doc 006 (2003-Sep)

Doc 003 (2003-Mar)

Doc 009 (2003-Mar)


Doc 011 (2003-Mar)

Doc 015 (2003-May)

Doc 004 Doc 005 (2003-May) Doc 013

Doc 014 (2003-Sep)

Doc 013 Doc 014 (2003-Jun) Doc 015

(2003-Aug)


4.3. The ORDER BY clause

The <orderby_specification> in ORDER BY clause is used to specify how to

sort a set of document pointers according to some level of a given dimension.

The set of document pointers should be grouped first according to the

<groupby_specification> specified in the prior GROUP BY clause. Besides,

<orderby_specification> can also be any member property or dimensions cre-

ated from document metadata (e.g., such as those defined in Dublin CoreMetadata Element Set [8]), which can be used to order elements in each group.

Each <orderby_specification> can be broken down as follows (if <order> is

missing, then the default <order> is usually ASC):

<orderby_specification> ::= <Dimension_name>.<Level> [<order>]

j <Dimension_name>.<Member>.<Property> [<order>]

;

<order> ::= ASCj DESC;

Example 7. Suppose we append an ORDER BY clause in the query specified

in Example 5 as follows:

Table 7

A sample query result with grouping, having, and ordering on DC



Doc 002 (2003-Mar)

Doc 001 (2003-Jun)

Doc 002 (2003-Aug)

Doc 024 (2003-Mar)

Doc 006 Doc 007 (2003-Jul)

Doc 018 Doc 019 (2003-May) Doc 010

(2003-Jul) Doc 010 Doc 012 (2003-Apr)

Doc 016 Doc 017 (2003-May)

North

Doc 001 (2003-Apr)

Doc 020 Doc 021 (2003-Jun)

Doc 022 Doc 023 (2003-Jul)

Doc 023 (2003-Aug)

Doc 006 (2003-Sep)

Doc 003 (2003-Mar)

Doc 009 (2003-Mar)


Doc 011 (2003-Mar)

Doc 015 (2003-May)

Doc 004 Doc 005 (2003-May) Doc 013

Doc 014 (2003-Sep)

Doc 013 Doc 014 (2003-Jun) Doc 015

(2003-Aug)





FROM DCGROUP BY (Time.[2003].[Month])HAVING COUNT (Time.[2003].[Month]) >= 2ORDER BY Time.[2003].[Day]

Then, the query result may be visualized as Table 7 describes. Elements in

each group will be ordered by the date.

5. Putting it all together

We have discussed the necessary constructs as included in traditional SQL in

Sections 3 and 4. Here we put these constructs together to see how to evaluate

an MD2X query. The complete syntax is listed as follows. The grammar has

undefined non-terminal symbols <cube_specification> and <slicer_specifica-

tion>, where the former is completed with the name of a single document cube

and the latter is any valid tuple of the form (member-of-D1,member-of-

D2, . . .,member-of-Dn).


<query> ::= SELECT [<axis_specification> [, <axis_specification> . . .]]FROM [<cube_specification>]

[WHERE [<slicer_specification>]]

[GROUP BY <groupby_specification>

[HAVING <filter_specification>]

[ORDER BY <orderby_specification>]];

<axis_specification> ::= <set> ON<axis_name>;

<axis_name> ::= COLUMNj ROWSj PAGESj SECTIONSj CHAPTERSj AXIS (<index>)

;

<groupby_specification> ::= (<Dimension_name>[.<Level_name>], . . .);

<filter_specification> ::= <logical_expression>

j {<Dimension_name>.<Member> [,. . .]};

<logical_expression> ::= <condition>j (<condition>)j <condition> AND <logical_expression>

j <condition> OR <logical_expression>

j NOT <logical_expression>

;

<condition> ::= COUNT(<groupby_specification>)<theta_op><constant>;

<theta_op> ::= > j < j <= j >= j <> j =;

<orderby_specification> ::= <Dimension_name>.<Level> [<order>]

j <Dimension_name>.<Member>.<Property> [<order>]

;

<order> ::= ASCj DESC;

The evaluation order for a query containing all of the constructs is as

follows:

1. According to the FROM clause, the specified cube will be obtained by

<cube_specification>.


2. Use the <slicer_specification> in the WHERE clause to slice the cube and

eliminate unnecessary cells.

3. In the third step, use the <groupby_specification> in the GROUP BY clause

to group documents into a set of groups.

4. For each group obtained, use the <filter_specification> in the HAVING

clause to filter out unnecessary groups.5. Then, use the <orderby_specification> in the ORDER BY clause to sort

documents contained in each group.

6. Finally, select the dimensions and place them on the right axes according to

the list of <axis_specification>.

6. Conclusion and future directions

6.1. Conclusion

While data warehouses and the numeric-centric business intelligence tech-

nologies have served most of the enterprises well, they do not fully address

the complete scope of business intelligence. In this paper, we advocate the

importance of constructing document warehouses to support text-centric busi-

ness intelligence, and propose a multi-dimensional query language for docu-

ment warehousing. When documents are warehoused, users can use MD2Xto perform ad hoc on-line analytical processing (OLAP) over text in a docu-

ment warehouse, which is just as the way users can perform OLAP over sum-

marized data in a data warehouse.

The applications of document warehousing are versatile. In business, docu-

ment warehousing can help administrators organize the meeting reports, gaz-

ettes, or even customer complaint e-mails, where the company personnel,

products, and time may be regarded as the dimensions, such that documents

related to some employees, or products in some time, at somewhere can be re-trieved or browsed instantly. In recent years, we have seen most of the data

warehouse applications were applied in Customer Relationship Management

(CRM), a promising trend in business affairs. However, a data warehouse cre-

ation only supports the numeric analyses of customer behaviors. To obtain the

reason of why customers buy (or did not buy) some products, we need a doc-

ument warehouse to be established. By data warehousing, users can realize

business phenomena regarding who, what, when, where, and which clearly. Nev-

ertheless, to discover why the phenomena occur, a document warehouse shouldbe employed [33].

When documents are warehoused, the task of version control will become

very easy, since users can directly tracing the documents based on some criteria

along the time dimension. Besides, document clustering can be achieved di-

rectly via visualizations. Users can also develop some document summarization

Table 8

A comparison between document warehousing and data warehousing

Document warehousing Data warehousing

Similarities (1) Both have the same construction process. We may employ star schema or

snowflake [25] to design the modeling process.

(2) Both gather business document/data from heterogeneous resources.

(3) Users can do on-line analytical processing over the established result.

Differences (1) Intend to obtain text-oriented

business intelligence.

(1) Intend to obtain

numeric-oriented business

intelligence.

(2) Resources gathered from market

survey reports, project status reports,

meeting records,customer complaints,

e-mails, patent application sheets,

and advertisements of competitors.

(2) Resources gathered from

internal databases of POS

(point-of-sale) systems,

ERP (enterprise resource

planning) systems, accounting

systems, or financial

management systems.

(3) It filters out unnecessary documents

and intends to help users to address

problems regarding why.

(3) It aggregates numerical

data according to various

dimensions, and intends to

help users to address problems

regarding who, what, when, where,

and which.

(4) Enriched with text mining

techniques to summarize

documents or categorize documents.

(4) Enriched with data mining

techniques to summarize, classify,

cluster formatted data or find

the associations.

(5) Document sources should be integrated

in file systems, or native XML databases [5,6].

(5) Data sources can be integrated

in relational databases.


tools to summarize a cluster of related documents. To sum up, data warehous-

ing and document warehousing are not only one of the most infrastructure of

knowledge management, but also the kernel of customer relationship

management.

In summary, document warehousing and data warehousing are used for

respectively organizing documents and formatted data in a multi-dimensionalbasis. We compare their similarities and differences in Table 8.

6.2. Future works

In our future work, we will propose an architecture for document warehous-

ing. The preliminary components may include the following modules:

1. Employ XML Schema [21] to define document metadata. We advocate usingthe Extensible Markup Language (XML) to be the intermediate media for

document interchange.


2. Incorporate automatic text summarization [11–13,26], key feature extraction

[9], or even document classification and categorization [2] techniques for doc-

ument warehousing. Develop related text summarization techniques to

extract the most important 10–20% content for users to digest the docu-

ments more easily and propose how to bind a document summary with

its corresponding documents for document warehousing.3. Automatic document metadata decomposition and the mechanisms for storing

the obtained metadata into native XML or XML-enabled databases [4–6].

This helps users to manage document warehouses more efficiently.

Besides, although the dimension concepts defined in this paper are organ-

ized into hierarchical structures, however, it is assumed that when scanning a

document, the system will ignore the hierarchical relationships among key-

words in the document. Based on this work, we wish to incorporate some nat-ural language processing technologies to enhance the linguistic analysis and

annotation results of documents parsing, and elaborate the work of adopting

domain-specific ontology [7,10,29,32] with more refined concepts to be built

in the corresponding dimensions of a document cube. Ontological analysis

can help to clarify the structure of knowledge regarding a set of related docu-

ments. Given a set of related documents corresponding to a specific domain,

the ontology forms the semantic heart of any system of knowledge representa-

tion, and their document cube forms the syntactic centroid of any system ofconcept organization.

Finally, since the construction of a document warehouse has to scan a large

amount of documents, which is a task prone to time-consuming, the parallel

architecture for such process will be further investigated in the future.

References

[1] S. Anahory, D. Murray, Data Warehousing in the Real World: A Practical Guide for Building

Decision Support Systems, Addison-Wesley Longman, Harlow, England, 1997.

[2] A. Appiani, F. Cesarini, A. Colla, M. Diligenti, M. Gori, S. Marinai, G. Soda, Automatic

document classification and indexing in high-volume applications, International Journal on

Document Analysis and Recognition 4 (2) (2002) 69–83.

[3] M.J.A. Berry, G. Linoff, Data Mining Techniques: For Marketing, Sales, and Customer

Support, John Wiley & Sons, New York, 1997.

[4] E. Bertino, B. Catania, Integrating XML and databases, IEEE Internet Computing 5 (4)

(2001) 84–88.

[5] E. Bertino, E. Ferrari, XML and database integration, IEEE Internet Computing 5 (6) (2001)

75–76.

[6] M. Champion, Native XML vs. XML-Enabled: the difference makes a difference, Software

AG: The XML Company, http://www.softwareag.com/xml/library/champion_nativexml.htm.

[7] B. Chandrasekaran, J.R. Josephson, V.R. Benjamins, What are ontologies, and why do we

need them?, IEEE Intelligent Systems 14 (1) (1999) 20–26.

http://www.softwareag.com/xml/library/champion_nativexml.htm


[8] Dublin Core Metadata Initiative, http://dublincore.org/.

[9] F.F. Feng, W.B. Croft, Probabilistic techniques for phrase extraction, Information Processing

and Management 37 (2) (2001) 199–220.

[10] N. Fridman, C.D. Hafner, The state of the art in ontology design, AI Magazine 18 (3) (1997)

53–74.

[11] J. Goldstein, M. Kantrowitz, V. Mittal, J. Carbonell, Summarizing text documents: sentence

selection and evaluation metrics, in: Proceedings of SIGIR, 1999, pp. 121–128.

[12] R. Hackathorn, Data warehousing energizes your enterprise, Datamation 1 (February) (1995)

38–42.

[13] U. Hahn, I. Mani, The challenges of automatic summarization, IEEE Computer 33 (11) (2000)

29–36.

[14] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers,

2001.

[15] http://www.applix.com.

[16] http://www.microsoft.com.

[17] http://www.microstrategy.com.

[18] http://www.sap.com.

[19] http://www.sas.com.

[20] Development snapshot: warehouse data of the future, application development trends,

February 2000, http://www.survey.com.

[21] http://www.w3.org/xml/schema.

[22] http://www.whitelight.com.

[23] IBM Corporation, Intelligent Miner for Text: Text Analysis Tools version 2.10.0, http://www-

3.ibm.com/software/data/iminer/fortext/.

[24] W.H. Inmon, Building the Data Warehouse, John Wiley & Sons, New York, NY, 1993.

[25] R. Kimball, The Data Warehouse Toolkit: Practical Techniques for Building Dimensional

Data Warehouses, John Wiley & Sons, 1996.

[26] K. Knight, Mining online text, Communications of the ACM 42 (11) (1999).

[27] S.-H. Lin, C.-S. Shih, M.C. Chen, J.-M. Ho, M.-T. Ko, Y.-M. Huang, Extracting classification

knowledge of internet documents with mining term associations: a semantic approach, in:

Proceedings of SIGIR, 1998.

[28] S. Loh, L.K. Wives, J.P. de Oliverira, Concept-based knowledge discovery in texts extracted

from the web, SIGKDD Explorations 2 (1) (2000).

[29] G.A. Miller, Wordnet: an online lexical database, International Journal of Lexicography 3 (4)

(1990) 235–312.

[30] Oracle Corporation, InterMedia Text 8.1.6, http://otn.oracle.com/products/text/x/Tech_Over-

views/imt_817.html.

[31] G. Spofford, MDX Solutions—With Microsoft SQL Server Analysis Services, John Wiley &

Sons, 2001.

[32] V. Sugumaran, V.C. Storey, Ontologies for conceptual modeling: their creation, use, and

management, Data and Knowledge Engineering 42 (2002) 251–271.

[33] D. Sullivan, Document Warehousing and Text Mining: Techniques for Improving Business

Operations, Marketing and Sales, John Wiley & Son, 2001.

[34] A.-H. Tan, Text Mining: the state of the art and the challenges, in: Proceedings of PAKDD

99—Workshop on Knowledge Discovery from Advanced Databases, Beijing, 1999, pp. 50–70.

[35] F.S.C. Tseng, W.P. Lin, A study on indexing structure and its properties for constructing

document warehouses, in: Proceedings of The 20th Workshop on Combinatorial Mathematics

and Computation Theory, Chiayi, Taiwan, August 2003, pp. 18–27.

http://dublincore.org/

http://www.applix.com

http://www.microsoft.com

http://www.microstrategy.com

http://www.sap.com

http://www.sas.com

http://www.survey.com

http://www.w3.org/xml/schema

http://www.whitelight.com

http://www-3.ibm.com/software/data/iminer/fortext/

http://www-3.ibm.com/software/data/iminer/fortext/

http://otn.oracle.com/products/text/x/Tech_Overviews/imt_817.html

http://otn.oracle.com/products/text/x/Tech_Overviews/imt_817.html

design of a multi-dimensional query expression for document warehouses

Documents