report.doc

Table of Contents1 INTRODUCTION.............................................................................32 HISTORY.........................................................................................6

2.1 Early systems............................................................................63 NATURAL LANGUAGE PARSING...................................................7

3.1 Rule-Based Syntactic Parsing...................................................73.2 Terminal Symbols.....................................................................73.3 Non-terminal symbols...............................................................73.4 Production Rules.......................................................................7

3.4.1 Grammar.............................................................................73.4.2 Parse tree............................................................................8

3.4.2.1 Top down.........................................................................83.4.2.2 Bottom up........................................................................9

3.5 Probabilistic Parsing...............................................................113.5.1 Disambiguation.................................................................113.5.2 Training............................................................................11

3.5.2.1 Treebank........................................................................123.5.2.2 Incremental learning.....................................................12

3.6 Semantic Parsing....................................................................133.6.1 Semantic Data Models......................................................133.6.2 Case Based Reasoning......................................................143.6.3 Semantic Representation..................................................143.6.4 Actions of the Parser........................................................15

4 NLIDB ARCHITECTURE...............................................................174.1 Pattern-matching systems.......................................................174.2 Parsing based systems............................................................17

4.2.1 Semantic grammar based parsing....................................184.2.2 Translation........................................................................19

5 MARKET TEST..............................................................................235.1 Goals.......................................................................................235.2 Tests........................................................................................235.3 Results.....................................................................................23

5.3.1 Impressions.......................................................................235.3.1.1 Microsoft English Query................................................235.3.1.2 Elfsoft............................................................................24

5.3.2 Query results....................................................................256 FUTURE........................................................................................26

6.1 Language challenges..............................................................266.2 Portability challenges.............................................................266.3 Competing systems.................................................................266.4 Possible avenues.....................................................................26

6.4.1 Adaptation techniques......................................................276.4.2 Speech-based techniques.................................................27

1

6.4.3 Learning algorithms.........................................................276.4.3.1 User Dialogue................................................................276.4.3.2 Neural Networks...........................................................286.4.3.3 Genetic Algorithms........................................................28

7 CONCLUSIONS.............................................................................298 BIBLIOGRAPHY.............................................................................329 CONTRIBUTIONS.........................................................................34

2

1 INTRODUCTIONThe ability to exercise language to convey different thoughts and feelings differentiates human beings from animals. The definition of Natural Language Processing is the capability of a machine to understand the full context of human language about a particular topic so that the unspecified guess and general knowledge can be understood. “Thus if the machine is able to achieve this, it has come close to the notion of artificial intelligence itself”1.

One may find interacting with a foreign person with no knowledge of English intricate and frustrating. Thus, a translator will have to come into the picture to allow one to communicate with the foreigner. Companies have related this problem to extracting data from a database management system (DBMS) such as MS Access, Oracle and others. A person with no knowledge of Structured Query Language (SQL) may find himself or herself handicapped in corresponding with the database. Therefore, companies like Microsoft and Elfsoft (English Language Frontend Software) have analysed the abilities of Natural Language Processing to develop products for people to interact with the database in simple English. This enables a user to simply enter queries in English to the Natural Language database interface. This kind of application is known as a Natural Language Interface to a DataBase (NLIDB).

The system works by utilizing the use of syntactic knowledge and the knowledge it has been provided about the relevant database.2 Hence, it is able to implement the natural language input to the structure, scope and contents of the database. The program translates the whole query into the standard query language to extract the relevant information from the database. Thus, these products have created a revolution in extracting information from databases. They have discarded the fuss of learning SQL and time is also saved in learning this query language.

This report will look at the performance of each database interface connected to a standard database. The Northwind database has been chosen as the default database to work on. There are several companies that are offering such products in the market. Our group has found several of them, which include English Query, Elfsoft, EasyAsk and NLBean created by Mr Mark Watson. We have requested for these companies for their permissions to test their products in regards to our research. We received positive responses from Elfsoft and NLBean, but had to settle for tests on Microsoft English Query

1 Manas Tungare2 Manas Tungare

3

and Elfsoft only. We have also contacted EasyAsk via email but the company has provided minimal assistance in our research.

In order to produce accurate conclusions on the different interpretations of each software, we have listed out over thirty questions to test the products. Each product will be asked the same questions in the same order. The questions have been carefully planned to test the pros and the cons of each product.

These questions include: Listing the specific columns and rows Counting Calculations Cross referencing from more than one tables Ordinal positions Followed-ups Conclusions Semantics Grammar mistakes Spelling mistakes Out-of-context questions

There are three components in a natural language dialog system: analysis, evaluation and generation.3 The analysis component translates the query as entered by the user into a semantic representation which is transcripted in the knowledge representation language. There may be several communication sessions between the natural language access system and user interface system to the user in order to carry out the action to derive the result. The evaluation component allows information to be absorbed by the dialog system when queries have to be satisfied or the system needs to alert the user about any major state changes. The generation component gathers the intended information that the user wants to see as provided in the query. This component will generate text, graphs, query or any other responses according to the situational context of the query.4

The knowledge-based database assistant (KDA) as stated, is a practical development of an intelligent database front-end to assist novice users in retrieving desirable information from an unfamiliar database system.5 This component exists in both Microsoft English Query and Elfsoft. Thus, this useful program directs the novice user to get the relevant results by entering the accurate query or by 3 Dialog-Oriented Use of Natural Language4 Dialog-Oriented Use of Natural Language5 Manas Tungare

4

prompting the user when insufficient information is entered to get the appropriate answer. This component can be seen in the later part in this report in both programs.

In addition, “the KDA's responding functionality, which could change the user's knowledge state, is called query guidance”.6 It can detect a user’s scope of knowledge about the relevant database by studying the query entered by the user. If it sensed that the user has limited awareness about the database and could not retrieve his or her desired answer, the query guidance will jump into action and provide similar queries to allow the user to gather the appropriate facts from the database or present the most relevant query to the user based on the user’s perceived intention. Such a component allows the novice to get familiar with the database fast and enables the user to learn about the scope of the database based on the prompt messages and the queries generated from the KDA without the expense of learning those mass databases stored in most organizations.

6 Manas Tungare

5

2 HISTORYAs the use of databases for data storage spread during the 1970’s, the user interface to these systems represented a burden for designers worldwide. At this point, both the relational database model and the SQL interface language were yet to be developed, which means that the task of inserting and querying data was tedious and difficult.

It was therefore a logical step for programmers to attempt to develop more user-friendly and “human” interfaces to the databases. One of these approaches was the use of natural language processing, where the user interactively would be allowed to interrogate the stored information.2.1 Early systems The most well-known historical natural language database interface systems are:

LUNAR, interfacing a database with information on rocks collected during American moon expeditions. It was originally published in 1972. When evaluated in 1977, it answered 78 % of questions correctly. Based on syntactic parsing, it tended to build several parse trees for the same query, and was deemed as inefficient7 and too domain-specific and inflexible.

LADDER, the first semantic grammar-based system, interfacing a database with information on US Navy ships.

CHAT-80, probably the most famous example. It interfaced a database of world geography facts. The entire application (both the database and the user interface) was developed in Prolog. As the source code was freely distributed, it is still used and cited. An online version can be found at8.

7 Hafner, C. D. and Gooden, K. pp 141-1648 ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80

6

3 NATURAL LANGUAGE PARSING3.1 Rule-Based Syntactic Parsing Syntax means ways that words can fit together to form higher-level units such as phrases, clauses, and sentences. Therefore syntactically driven parsing means interpretations of larger groups of words are built up out of the interpretation of their syntactic constituent words or phrases. In a way this is the opposite of pattern matching as here the interpretation of the input is done as a whole.

Syntactic analyses are obtained by application of a grammar that determines what sentences are legal in the language that is being parsed.Syntactic parsing operates through the translation of the natural language query into a parse tree, which is then converted to a SQL query. There are a number of fundamental concepts in the theory of syntactic parsing.3.2 Terminal Symbols A terminal symbol is the basic building block of the language, i.e. words and delimiters. Together, the set of terminal symbols form the “dictionary of words”9 recognised by the system, i.e. the range of the vocabulary that it can read and interpret.3.3 Non-terminal symbols Non-terminal symbols are higher-level language terms describing concepts and connections in the syntax of the language. Examples of non-terminal symbols include1 sentence, noun phrase, verb phrase, noun, and verb.3.4 Production Rules As the query is analysed, a number of production rules fires to identify and classify the context of the read word. In analogy with a production system (such as the one used in PROLOG), a production rule in a context-free grammar10 converts a left-hand non-terminal symbol to a sequence of symbols, which can be either terminal or non-terminal. Examples of production rules:

Sentence := Noun phrase verb phrase Verb phrase := verb

These rules are also commonly referred to as rewrite rules. 3.4.1 GrammarThe combination of the set of terminal symbols, set of non-terminal symbols, the production rules and an assigned start symbol (the

9 Luger, G.F. and Stubblefield, W.A.10 This paper will be constricted to the treatment of context-free grammars and not deal with the more complex set of syntaxes known as context-sensitive.

7

highest-level construct in the system, usually sentence) form the grammar of the syntax. The role of the grammar is to define:

What category each word belongs to; What expressions are legal and syntactically correct; How sentences are generated.

3.4.2 Parse treeThe system analyses the sentence by reading the non-terminal symbols in order and identifying what production rule to fire. As it does so, it gradually builds a representation of the sentence referred to as a parse tree. The term has been coined from the tree-like graph that is produced, where the root is the top-level symbol (e.g. sentence), the children of each node are the right-hand non-terminal symbols and the leaves are the terminal symbols (the words). The parse tree can be built in two fundamentally different ways.3.4.2.1 Top downA top down parser starts at the root and gradually builds the tree downwards by matching the read terminal symbols with symbols on the right-hand side of possible production rules. Terminal or non-terminal symbols on the right hand side are added at the level below the current symbol.This is similar to the goal-driven approach of a production system. The basic architecture of a top down parser is illustrated in figure 1.

8

Figure 1 Top down parsing of the sentence "the girl forgot the boy"11

In many situations, the first token alone does not provide enough information to make the decision on what production rule should be fired. In order to overcome this, there are two basic methods.

3.4.2.1.1 Recursive DescentThe system starts by firing the first production rule of the candidates for which the given terminal symbol could fit and builds the initial sub tree from this information. If this further downwards in the tree results in an inconsistency or syntactic error, it reverts to the point where the decision was made, removes all the nodes on the way back up and selects another of the possible productions. This is a procedure very similar to depth-first searching and backtracking in production systems.

3.4.2.1.2 Look AheadLook Ahead systems will not be contented by just reading one token. Rather, it reads the number of tokens necessary to identify the given

11 Dougherty, R.C.

9

right-hand side beyond any ambiguities before firing any production rules.

Grammars are characterised by the maximum number of terminal symbols required to read before all possible conflicts in the choice of production rule can be resolved. If this number is k, the grammar is referred to as an LL (k) grammar12. The look ahead procedure is more in analogy with a breadth-first search technique.3.4.2.2Bottom up A bottom up parser, on the other hand, works from the leaf upward by “tagging” the tokens, i.e. starting from the right-hand side of the production rules and associating the read word with its category. When a full right-hand side has been identified, the production rule fires and the left-hand side non-terminal symbol is added as a branch in the level above. This methodology corresponds to the data-driven technique of production systems. The bottom up parsing technique is illustrated in figure 2.

12 Eriksson, G.

10

Figure 2 Bottom up parsing of the sentence "the girl forgot the boy"13

In some cases, the sentence is ambiguous in itself and there are multiple production rules that match a given sentence, in which case the parser has to make a choice between the two potential interpretations. One strategy for dealing with these situations is referred to as probabilistic parsing.

13 Dougherty, R.C.

11

3.5 Probabilistic Parsing Probabilistic parsing takes an empirical approach to the difficult task of disambiguation, i.e. identifying which of several mutually exclusive alternate syntactic parse trees should be generated.

For example, consider the sentence “One morning I shot an elephant in my pyjamas”14. There are two possible syntactic parses for this sentence15. One implies that the person was wearing the pyjamas, while the opposing view would claim that the elephant was in the underwear (hence the joke). Although the selection between these two interpretations is obvious to a human, how is this knowledge automated in a computer?

One option, used in a.k.a. attribute grammars, is to encode information for each verb as a parameter to each production rule. However, as the dictionary grows, this approach may be too selective and require every different case to be specifically added to the production rules.

Probabilistic parsing, on the other hand, works by augmenting the rules with assigned probabilities, representing the chance of the particular expansion (production rule) being the correct one.

For example, a probabilistic grammar would introduce the following enhancements to the possible regular syntactic production rules for the expansion of the non-terminal symbol sentence [Error: Reference source not found]:

Sentence:= Nounphrase Verbphrase, P = 0.8 Sentence:= Auxiliary Nounphrase Verbphrase, P = 0.15 Sentence:= Verbphrase, P = 0.05

Note that the probabilities for the expansions of any given non-terminal symbol always add up to 1.

3.5.1 DisambiguationHow does probabilistic parsing choose a parse tree from two possible interpretations? In most systems, it simply compares the products of all the probabilities involved in every production required for the competing parses and selects the one representing the highest of these probabilities.

14 Groucho Marx15 Jurafsky, D. & Martin, J.

12

3.5.2 TrainingOne important task concerns how to set the probabilities. There are two fundamentally different techniques for this task [Error: Referencesource not found].3.5.2.1TreebankA large database of sentences with their correct parses (parsed by knowledgeable humans) is entered into the system. The respective probabilities are then calculated as the relative frequencies of each possible parse. For more details, see [Error: Reference source not found].

The largest known treebank is known as the Penn Treebank16. The latest version, Treebank 3 contains parses of17:

One million words of 1989 Wall Street Journal material; A small sample of ATIS-3 transcripts. The Air Travel Information

Service is a joint project of DARPA (Defence Advanced Research Projects Agency) and SRI International, handling voice-based queries and requests about flights. More information can be found at18;

A fully parsed tagged version of the Brown Corpus, consisting of one million words from 500 different sources (novels, academic books, newspapers, non-fiction books etc. [Error: Reference source not found]);

Parsed and tagged text from a set of 560 transcripts of telephone conversation (a.k.a. the Switchboard-1 corpus).

This is a widely used “training set” (in analogy with an artificial neural network) enabling the parser to learn what classes of speech a given word can belong to and how frequently a particular expression is to be interpreted in different ways.

3.5.2.2Incremental learningThe other technique is a “trial and error” method, in which the parsing system much like an artificial neural network learns as it is used.

The initial probabilities can be assigned randomly or by the user. After that, the system adjusts these probabilities according to the following rules [Error: Reference source not found]:

If the sentence was unambiguous, its parse count is increased by 1, i.e. pi: = pi +1;

16 Penn Treebank Project. 17 Quoted by the LDC office of the University of Pennsylvania in an email dated 10/7-2001.18 Language Reference

13

If the sentence was ambiguous, each of the possible parses have their counts incremented by their respective probabilities, i.e. pi: = pi + P (p i ).

The algorithm for this computation is referred to as the Inside - Outside Algorithm. It was originally proposed in19 and is described in detail in20.3.6 Semantic Parsing The syntactic structure of a sentence is not enough to express its meaning. For instance, the noun phrase the catch can have different meanings depending on whether one is talking about a baseball game or a fishing expedition. To talk about different possible readings of the phrase the catch, one therefore has to define each specific sense of the phrase. The representation of the context-independent meaning of a sentence is called its logical form.21 Natural language analysis based on semantic grammar is similar to syntactically driven parsing except that in semantic grammar the categories used are defined semantically.

Database items can be ambiguous when the same item is listed under more than one attribute. For example, the term “Mississippi” is ambiguous between being a river name or a state name, in other words, two different logical forms. The two different meanings have to be represented distinctly for an interpretation of a user query.

3.6.1 Semantic Data ModelsSemantic data models (SDM) are widely researched in the database community. They are closely related to semantic networks used in artificial intelligence, which were originally developed to support natural language processing. Hence, as database management systems they are capable of supporting large amounts of information, while still offering the potential of advanced inferencing capabilities including NLP, machine learning, and query processing.

“SDMs can be seen as formalising many of the relationships, expressed in an ad hoc manor in conventional hypermedia systems.”22 SDMs support a variety of formalised links and relationships. An example of a small network on insects is shown in figure 3. The links in this graph express generalisation relationships or "ISA" (beneficial insect IS-A insect), part/whole (Abdomen is part of an Insect),

19 Baker, J.K. pp. 547-550.20 Manning, C.D. and Schutze, H.21 Tang, R. L. p5 22 Beck, H., Mobini, A., Kadambari, V

14

association (Ladybugs eat Aphids), and class/instance (Ladybug is an instance of Beneficial Insect). 23

Figure 3 Semantic Data Model describing insects24

In figure 3, solid lines are ISA relationships, diamonds are part/whole, circles are associations, and Instances are underlined.

Since concepts in SDMs are described by structured graphs expressing the relationships among symbols rather than connections between text files as in conventional hypertext, there exists the capability for manipulation of SDMs to produce a number of desirable functions. Foremost is that of search or query processing. [3] Suggests query processing based on graph matching techniques by which the query is expressed as a small semantic network. This query graph is then matched against the larger database graph to find connections. This gives a much more precise search capability than is possible with Boolean keyword searches over text files.

3.6.2 Case Based ReasoningIn order to construct an NLP system, one must construct a large dictionary. Much of the recent advances in text understanding systems can be attributed to advances in design and construction of large lexicons. But that presupposes that word meaning is easily represented and a case-based reasoning approach to meaning is used. Words obtain meaning by how they are used. A particular word is used in many different situations and contexts. Each occurrence of the word is treated as one case. Similarities among cases can be observed, and cases with similar usage can be clustered together into categories. When a word is used in a new situation, similar cases are retrieved from the case-based memory in order to apply what happened before to the new context. The meaning of a particular

23 Beck, H., Mobini, A., Kadambari, V24 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig1.gif

15

word is established by a large case base, and thus a single word may be "worth 1,000 cases". 25

3.6.3 Semantic RepresentationThe most basic constructs of the representation language are the terms used to describe objects in the database and the basic relations between them. Database objects bear relationships to each other or can be related to other objects of interest to a user who is requesting information from it. For instance, in a user query like “What is the capital of Texas?”, the data of interest is a city with certain relationship to a state called Texas, or more precisely its capital. The capital/2 relation, or predicate, is therefore defined to handle questions that require them.

Predicates Descriptioncity (C)capital (S,C)density (S,D)

loc (X,Y)len (R,L)next_to (S1,S2)traverse (R,S)

C is a cityC is the capital of SD is the population density of state SX is located in YL is the length of river RState S1 borders S2River R traverses state S

Table 1 Sample of predicates26

3.6.4 Actions of the ParserUsing the parser actions in CHILL [23] known as shift-reduce parsing we will discuss the working of the parser. The parser actions are generated from templates given by a logical query. An action template will be instantiated to form a specific parsing action. Recall that the parser also requires a lexicon to interpret meaning of phrases into specific logical forms. Consider the following example27:

Sentence: What is the capital of Texas?Logical Query: answer(C,(capital(C,S),const(S,stateid(Texas)))).

A very simple lexicon will map ‘capital’ to ‘capital(_,_)’ and ‘Texas’ to ‘const(_,stateid(texas))’. The parser begins with an initial stack and a buffer holding the input sentence, which is the initial parse state. Each predicate on the parse stack has an attached buffer to hold the

25 Beck, H., Mobini, A., Kadambari, V.26 Lappoon R. T. p627 Tang, R.L.

16

context in which it was introduced. Words from the input sentence are shifted onto the stack buffer during parsing. The initial parse is as follow:

Parse Stack: [answer(_,_):[]]Input Buffer: [what,is,the,capital,of,texas,?]

Since the first three words in the input buffer do not map to any logical forms, the next sequence of steps is to push the three words from the input buffer onto the parse stack. The process has the following result:

Parse Stack: [answer(_,_):[the,is,what]]Input Buffer: [capital,of,Texas,?]

Now, ‘capital’ is at the head of the input buffer and is mapped to ‘capital(_,_)’ in the lexicon. The next action is to push the logical form onto the parse stack. The resulting parse state is as followed:

Parse Stack: [capital(_,_):[],answer(_,_):[the,is,what]]Input Buffer: [capital,of,Texas,?]

The parser then binds two arguments of two different logical forms to the same variable, resulting in the following parse state:

Parse Stack: [capital(C,_):[],answer(C,_):[the,is,what]]Input Buffer: [capital,of,Texas,?]

The sequence repeats itself producing a parse state:

Parse Stack: [const(S,stateid(Texas)):[?,Texas]capital(C,S):[of,capital],answer(C,_):[the,is,what]]Input Buffer: []

The final step is to take the logical form on the parse stack and put it into one of the arguments of the meta-predicate resulting in:

Parse Stack: [answer(C,(capital(C,S), const(S,stateid(Texas)))):[?,Texas,of,capital,the,is,what]]Input Buffer: []

As this is the final parse state, the logical query is then constructed from the parse stack.

17

4 NLIDB ARCHITECTURE4.1 Pattern-matching systems The first NLIDBs were based on pattern-matching techniques. As a simple illustration of pattern matching technique, consider the following database:

Countries_TableCountry Capital LanguageFranceItaly…

ParisRome…

FrenchItalian…

Table 2 Sample Database Table28

A primitive pattern-matching system according to [1] may use rules as:

Pattern: … ”capital” … <country> Action: Report CAPITAL of row where COUNTRY = <country>

Pattern: … “capital” … “country” Action: Report CAPITAL and COUNTRY of each row

If the user asked “What is the capital of France?”, using the first pattern rule the system would report “Paris”. The system would also use the same rule to handle questions such as “Print the capital of Italy”, “Could you please tell me what is the capital of France?” etc.

Some advantages of this approach are that it requires no complicated parsing or interpretation modules, and that it is easy to implement. But the main advantage of this approach is its simplicity. However the shallowness of this approach often lead to bad failures. An example is when a pattern-matching NLIDB was asked “TITLES OF EMPLOYEES IN LOS ANGELES.” the system reported the state where each employee worked, assuming the “IN” to denote the post code of Indiana, and assumed that the question was about employees and states.29

4.2 Parsing based systems In general as [21] suggests, the system architectures of some NLIDBs can be seen as being made of two major modules. The first module controls the natural language, where a question is submitted and successively transformed. At the end of this process one or more

28 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.1429 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P., pp.14-15

18

intermediate logical query expressions is obtained. Given the dimension of the domain and the flexibility of the natural language, there usually exist several interpretations of the same question. The second component is in charge of the connection with the database, translating the expressions to structured query language (SQL) expressions (using mapping) and sending them to the Data Base Management System (DBMS) to produce the answers.30

For a graphical explanation of the structure, examine Figure 4.

Figure 4 NLIDB Architecture31

As described in the previous section, the source language sentence is first parsed, producing a parse tree. The two methods often found of parsing are the syntax based and semantic grammar based.

4.2.1 Semantic grammar based parsingUsing this technique, the grammar’s categories do not necessarily correspond to syntactic concepts. Examine the following figure:

30 Reis, P., Matias, J. and Mamede N. p.3-431 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18

19

Figure 5 Semantic base parsing tree32

Notice that some categories of the grammar (e.g.: Substance, Magnesium, Specimen_question) do not correspond to syntactic constituents (e.g.: Noun-Phrase, Noun, Sentence). This is because the semantic information about the knowledge domain (e.g.: a question may either refer to specimens or spacecraft) is hared-wired into the semantic grammar.33

Because the semantic grammar approach contains hard-wired knowledge about a specific knowledge domain, it is very difficult to transfer it to other knowledge domain. A new semantic grammar has to be written whenever the NLIDB is configured for a new knowledge domain.34

4.2.2 TranslationThe translation is usually based on several mapping tables. Figure 6 illustrates this process for both the addition of new information based on an input sentence and the processing of a related query. The query is represented by a small graph, which initiates the mapping to the semantic hierarchy. The small graph is mapped to the semantic network by creating a link from each node in the smaller graph to the corresponding nodes in the network starting with the most general concept (the root) and ending with the most specific. This will create a unique instance, which is the intersection of all of the nodes involved in the query and may be used to narrow down a neighbourhood based on the requested information. 35

The mapping process is bounded by rules, and completely based on the information of the parse tree. As an example of mapping rules, consider the previous query of “which rock contains magnesium” taken from [1]:32 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.1733 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.1734 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.1735 Beck, H., Mobini, A., Kadambari, V. [online]

20

The mapping of “which” is for_every X. The mapping of “rock” is (is_rock X). The mapping of an NP is Det’ N’, where Det’ and N’ are the

mappings of the determiner and the noun respectively. Thus resulting in for_every X (is_rock X).

The mapping of “contains’ is contains. The mapping of “magnesium” is magnesium. The mapping of a VP (V’ X N’). Thus resulting in (contains X

magnesium).

21

Figure 6 Mapping and Query Processing Model36

Figure 7 demonstrates when the user ask a query on how John spent his leisure time and displays how the answer to the query is produced by exploiting the relationship between "spending leisure time" and "having a chance to go fishing" (both are "doing").

Figure 7 Query processing model37

In many systems the syntax rules linking non-leaf nodes and the semantic rules are domain independent, and can be used in any application domain. The information describing the possible words (leaf nodes) and the logic expressions is domain dependent and has to be declared in the lexicon.38

As an example, consider the lexicon used in MASQUE [1] listing the possible words, “capital”, “capitals”, “border”, “borders”, “bordering”, “bordered”.

The logic expression of “capital”, “capitals” could be capital_of(Capital,Country).

The logic expression of “border”, “borders”, “bordering”, “bordered” could be borders(Country1,Country2).

The logic expression of “country” could be is_country(Country).36 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig2.gif37 http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/fig3.gif38 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19

22

Then the question, “What is the capital of each country bordering Greece?” would be mapped to this query:

answer([Capital, Country]):-is_country(Country),borders(Country, Greece),capital_of(Capital, Country).

The meaning of the logic query above is to find all pairs [Capital, Country], such that Country is a country, Country borders Greece, and Capital is the capital of Country.The interpreter also needs to consult a world model that describes the structure of the surrounding world as shown by the figure below. Typically, the model contains a hierarchy of classes of world objects, and constraints on the types of arguments each logic predicate may have.39

Figure 8 Hierarchy in world model40

39 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.18-1940 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. p.19

23

5 MARKET TESTIn order to get a good estimate of the current state of the technology, the applications presented in the previous chapter were subjected to a neutral test.5.1 Goals The goals of the tests were:

To get a thorough understanding of contemporary market applications;

To get an estimate of the relevance and importance of this type of systems;

To get some insight into what features are more and less important.

5.2 Tests The tests were carried out on the Northwind database, a sample database with information on a shipping company. The database comes as a demo with all distributed copies of Microsoft Access.

A number of different queries of different types were posed to the respective natural language front ends. The questions were classified as simple (S), average (A), or complex (C).

For a more comprehensive explanation of the considerations behind the testing procedures, see Appendix A.5.3 Results 5.3.1 Impressions5.3.1.1Microsoft English QueryEnglish Query is a development environment that enables programmers to produce natural language front ends for SQL 2000 databases. The product is included with SQL 2000. The tests were performed on a demo of English Query, developed by Microsoft to interface with the Northwind database.

The user interface has five fields, with the following functionalities: Query (user input) Interpretation of query Required operations Produced SQL statement Results

A screen shot from one of the queries is presented in Figure 9.

24

Figure 9 Microsoft English Query.5.3.1.2ElfsoftElfsoft works together with either VB or Access. Queries are entered in a query window (see Figure 10) and can be output either as database tables (see Figure 11) or in a graphical format.

Figure 10 Elfsoft query window.

25

Figure 11 Elfsoft answer output.

Elfsoft also includes several other options for enhanced portability, including:

Automatic analyser of any Access database Enabling the user to teach program meanings of phrases Allowing the user to explain why a query failed (what was

missing and/or wrong) Permitting the user to edit the dictionary Logging of queries for statistics

5.3.2 Query resultsThe results are summarised in Table 3. A full recollection of the questions asked is presented in Appendix B.

Table 3 Accuracy percentages.Type of query English

QueryElfsoft

Simple 71 23

Average 50 40

Complex 67 100

26

6 FUTUREDuring the mid-eighties it was believed that natural language processing systems would become a universal interface to databases worldwide41. However, due to the emergence of graphical interfaces to databases, the relative simplicity of SQL and the inherent problems of natural language processing they have never really caught on commercially42.

The current position of NLIDBs is probably best described by “it’s a great idea, but…” Although their usefulness is appreciated, they are still at a research stage. There are several reasons as to why their usage is not taking off on a broader scale.6.1 Language challenges It is still very hard to encode the vast source, complexity and ambiguity of a human language into a computer. The formalisms for representing language patterns are still not comprehensive enough to capture all the different ways that expressions and terms can be constructed and given meaning depending on the context.6.2 Portability challenges Although several systems for communication with individual databases have been successfully implemented and used, a general technique, which would allow the user to specify the database and use a system with any database management system (whether it be Access, SQL 2000, Oracle or any type), is still rather elusive. This would require the system to be able to recognize the fields and attributes of the new storage source seamlessly.

An even bigger hurdle to portability is the nature and scope of language understanding. Language use in different domains is very dissimilar, which means that any portable system has to have a huge vocabulary with terms from many different application domains and be able to recognize expressions from users of a wide variety of professions.6.3 Competing systems Graphical and form-based interfaces have become the de facto standard for database front ends. Because of the challenges presented above, these other types of systems are generally possible to develop in shorter time and at a lower cost.

41 Johnson, T.42 Androutsopoulos, I., Ritchie, G.D., and Thanisch, P. pp.29-81

27

6.4 Possible avenues There is still a lot of research going on in this area. Having explored the application of Natural Language Processing as database interfaces, the authors can see a number of different scenarios.6.4.1 Adaptation techniquesThere is a need for methodologies that would enable the user to specify the data source in a general descriptive language and to supply a given set of terms used within the domain. This would make the application portable from database to database.

This need has been recognised in [22], where a solution based on the general Resource Description Framework (RDF). The system outlined in [22] learns the pattern and domain vocabulary of any given database automatically and also contains an interface that allows the user to change the database model (classes, properties, tables etc.).6.4.2 Speech-based techniquesCertain authors [1] believe that natural language keyboard interfaces will be superseded by speech recognition systems. However, as such systems are of an even more complex nature, some of the linguistic challenges will have to be solved first. Research on NLIDBs can therefore be a base for the development of voice-based systems [1].6.4.3 Learning algorithmsEvery person has its own vocabulary and way of using language. There is absolutely no way that a program can contain all words in a language or all different meanings that a term may take on.

Further, the use of language changes over time, which means that the semantics and vocabulary of a system may become obsolete after a certain time of use.

An important challenge for a natural language database front end (or any natural language processing system in general) is to possess an ability to learn, as it is used, evolve with the user and adapt to new users. This ability is after all one of the definitions of artificial intelligence.

There are several ways in which this could potentially be achieved. Note that these are suggestions and not based on in-depth research.6.4.3.1User DialogueOne way to achieve learning would be to include a lexical editor, where the user could enter language terms and link them to their synonyms. They should also be able to specify the different forms of the word, e.g. noun plurals, adjective comparative forms, verb tenses etc.

28

This ability is present in Elfsoft.

6.4.3.2Neural NetworksBy use of probabilistic techniques, a system might be able to adjust probabilities of different parses based on training texts and test texts, which have been parsed and tagged by the user or obtained from linguists. By continuously retraining the network with parsed texts from the database-specific domain, the neural network would be able to pick up language patterns and learn incrementally.

6.4.3.3Genetic AlgorithmsAnother way would be for the system to obtain feedback from the user on the accuracy (e.g. ask the user whether queries were answered correctly) and adjust its language processing structure (production rules) by the use of genetic algorithms.

29

7 CONCLUSIONSThe project has focused on two main topics:

The techniques of translating a question in natural language into a database query, extracting the results that the user is looking for;

The leading contemporary applications on the market.

The underlying methods belong in the general natural language processing area, while any system has to select among several different techniques involving different degrees of syntactic analysis, semantic processing or a combination. A general feature seems to be the translation of the query in two steps, first to an intermediate language and then to a database query language, e.g. SQL.

The topic integrates approaches several other facets of artificial intelligence, e.g. production systems, neural networks, expert systems, and machine learning.

Two of the leading commercial software packages were tested with mixed results. Some rather complex queries were handled well, while the systems tended to have problems handling rather easy tasks. The sample sizes involved are too small to base any general conclusions on, however. The reason for this is that the configuration of the university computers at our disposal could not be used for testing the programs.

Many companies have overestimated the use of natural language processing in the database interface. Their interpretation of the system is that it is able to understand the significance of the query accurately. However, the system is not able to fully comprehend the human language and jargon unless it has been given the definitions for these terms relating to the relevant database.43 This mainly involves the semantic analysis. A sentence that is syntactically structured may have lead to various meanings, which may not even be similar to one another. Thus, as a result, this will produce undesirable conclusions in the database queries. This is one main reason why many systems tend to fail and explains why most companies would still rather rely on SQL programmers for their database processing.

Although these kinds of applications are rather unpopular, the authors enjoyed using them and encourage their future evolvement. From the experiences of the performed tests, systems have the potential to make the task of searching for information a lot less tedious and time-consuming.

43 Timo Honkela

30

The eventual success for natural language front ends will depend on how well they can adapt to new environments, both regarding databases and users’ way of using language. Two proposed benchmarks for these types of systems could be:

It has to be able to learn and understand the database faster than the user;

It has to learn natural language faster and easier than the user can learn a programming language.

31

ACKNOWLEDGEMENTSThe authors wish to extend their appreciations to the following people for their support during the course of the project:

Jon Greenblatt, President of English Language Frontend Software Co.

Girish Mohata, Teaching Fellow, IT School, Bond University

32

8 BIBLIOGRAPHY

1. Androutsopoulos, I., Ritchiey G.D., and Thanischz, P.: Natural Language Interfaces to Databases - An Introduction. Journal of Natural Language Engineering, vol. 1, No. 1. Cambridge University Press 1995

2. Baker, J.K.: Trainable grammars for speech recognition, Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, Acoustical Society of America 1979.

3. Beck, H., Mobini, A., Kadambari, V. A Word is Worth 1000 Pictures: Natural Language Access to Digital Libraries. University of Florida. http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/beckmain.html

4. Dialog-Oriented Use of Natural Language http://www.dfki.uni-sb.de/vitra/papers/ro-man94/node5.html. Accessed on 310701

5. Dougherty, R.C.: Natural Language Computing An English Generative Grammar in Prolog. Erlbaum, Lawrence Associates 1994.

6. EasyAsk - Applications Overview http://www.englishwizard.com/applications/index.cfm -. Accessed 19/7-2001

7. ECL I Vertiefung: natürlichsprachliche Zugangssysteme: chat80. http://www.ifi.unizh.ch/cl/broder/chat/chat80.htm. Accessed 12/7-2001

8. Eriksson, G.: Översättarteknik. KFS AB 1984.

9. Groucho Marx in the movie Animal Cracker.

10. Hafner, C. D. and Gooden, K.: Portability of Syntax and Semantics in Datalog. ACM Transactions on Information Systems, vol. 3. Association for Computing Machinery 1985.

11. Honkela, T., The Www Version Of Self-Organizing Maps In Natural Language Processing of Helsinki University of Technology – viewed on 22/07/01http://www.cis.hut.fi/~tho/thesis/

33

http://www.cis.hut.fi/~tho/thesis/

http://www.ifi.unizh.ch/cl/broder/chat/chat80.htm

http://www.englishwizard.com/applications/index.cfm

http://www.dfki.uni-sb.de/vitra/papers/ro-man94/node5.html

http://www.dfki.uni-sb.de/vitra/papers/ro-man94/node5.html

http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/beckmain.html

http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/beck/beckmain.html

12. Johnson, T.: Natural Language Computing: The Commercial Applications. Ovum 1985.

13. Jurafsky, D. and Martin J. H.: Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistic, and Speech Recognition. Prentice-Hall 2000

14. Language Reference http://www.darpa.mil/ito/psum2000/h165-0.html. Accessed 14/7-2001.

15. Luger, G.F. and Stubblefield, W.A.: Artificial Intelligence. Structures and Strategies for Complex Problem Solving. Third Edition. Addison-Wesley 1999.

16. Manas Tungare – Natural Language Processing http://www.manastungare.com/articles/nlp/natural-language-processing.asp. Accessed 30/07/01

17. Manning, C.D. and Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press 1999.

18. Natural-Language Database Interfaces from ELF Software Co http://www.elfsoft.com/ns/FAQ.htm -. Accessed 19/7 – 2001.

19. Palmer, M. and Finin, T.: Workshop on the Evaluation of Natural Language Processing Systems. Computational Linguistics, vol. 16, pp. 175-181. MIT Press 1990.

20. Penn Treebank Project http://www.cis.upenn.edu/~treebank/. Accessed 10/7 – 2001.

21. Reis, P., Matias, J., Mamede, N.: Edite – A Natural Language Interface to Databases, A new dimension for an old approach. http://digitais.ist.utl.pt/cstc/le/Papers/CSTCLE-12.PDF

22. Sharoff, S. and Zhigalov, V.: Register-domain separation as a Methodology for Development of Natural Language Interfaces to Databases. Proceedings of the IFIP TC.13 International Conference on Human-Computer Interaction. International Federation for Information Processing 1999.

23. Tang, R. L.: Integrating Statistical and Relational Learning for Semantic Parsing: Applications to Learning Natural

34

http://digitais.ist.utl.pt/cstc/le/Papers/CSTCLE-12.PDF

http://digitais.ist.utl.pt/cstc/le/Papers/CSTCLE-12.PDF

http://www.cis.upenn.edu/~treebank/

http://www.elfsoft.com/ns/FAQ.htm

http://www.manastungare.com/articles/nlp/natural-language-processing.asp

http://www.manastungare.com/articles/nlp/natural-language-processing.asp

http://www.darpa.mil/ito/psum2000/h165-0.html

Language Interfaces for Databases. University of Texas May 2000.

35

9 CONTRIBUTIONSThe respective chapters were produced by the following group members:

Chapter 1: JunChapter 2: HakanChapter 3: Aris and HakanChapter 4: ArisChapter 5: AllChapter 6: HakanChapter 7: Hakan and JunBibliography and report compilation: ArisAppendices: Hakan

36

APPENDIX AEvaluating SystemsIntroductionHow good is a natural language database interface? The answer to this question is hard to define. A survey conducted during the course of this project revealed the existence of no formal evaluation techniques. As long as this situation remains, an unambiguous answer to the question will elude all stakeholders in this area.Why is there a need?The need for formal evaluation schemes in this field, as in any other arises out of several stakeholders’ desires:

Users want a guide for choosing between systems; Companies want benchmarks for product development and

improvement; Companies need metrics for proving the capabilities of their

products.Current MarketingThe companies behind contemporary techniques market their products with some of the following arguments:

Ease of set up and integration with new databases. It is often mentioned [6,18] that end users will be relieved of the task of having to learn and understand the internal workings of the DataBase Management System (DBMS)

Money saved on searching Price Ease of integration across different DBMSs (Access, SQL

Server, Oracle etc.) Accuracy The possibility to perform searches on several data stores

simultaneouslyProblems There have been some attempts to define general formal metrics for natural language processing systems [19]. In [19], it was concluded that this is a difficult task for a number of reasons:

Systems are built using a variety of techniques; They are used in many different domains, where users’ needs

are varying; There is a lack of funding for research in this area. However, it is also concluded that database front ends

constitutes one of the type of systems where metrics potentially could be developed and adopted.

Black box metricsIn [19], a strong distinction is made between black box and glass box metrics. A black box approach only looks at the output generated by a certain input and does not take into account the architecture of system, or the efficiency of individual components.

Advantages It takes the user’s view; It can be applied across platforms, on systems with different

implementation details; It doesn’t tie to a specific implementation technique; It can be used over time, regardless of trends in database and

programming methodologies.Disadvantages:

It doesn’t give a good indication to programmers of what is actually wrong;

It is badly suited for testing individual components of a system.Proposed black box evaluation schemeThe proposed evaluation scheme takes into account several different aspects of the program in question.

Evaluation can be based on the following characteristics:Overall Characteristics

User Friendliness: Is the application easy to understand and use? Are help files accessible and explanatory? Are error messages clear?

Portability: Can it be used in conjunction with only a specific database? If no, how easy is it to integrate it with other databases?

Speed: How fast are answers extracted? Fault Tolerance: Can the system recognize off-topic questions

(queries on information that is not in the database) and give an informative response within a reasonable time frame?

Accessibility: Can it be used over the web?VocabularyCan the system accurately understand the following expressions44:

What? Which? How many? How much? Show List Tell

44 This list is arbitrary and may have to be expanded/contracted.

CountEase of Interaction

Linguistic Flexibility: How many spelling errors in a word can the system tolerate and understand? Can it suggest alternative spellings45?

Probing questions: Are “follow-up” questions (questions referring to the previous answer) allowed?

Can the system adjust for bad grammar and still understand the question?

Accuracy based on input complexityThe system is asked a number of different questions. These questions are ranked as simple, average or complex. The accuracy (percentage of questions answered correctly) in each of the three categories is noted.

The evaluation scheme formed the basis of the market tests of chapter 5. However, because of the small sample size of tested applications, no attempt to formalize the scheme or develop a metric based on it was made.

45 For an example of this capability, please try a search on http://www.google.com with a word containing a slight spelling error, e.g. elpheants.

http://www.google.com/

APPENDIX BTest ProtocolThe questions asked, their respective classifications, and the outcomes for the tested programs are presented in table 4. In the classification column, S stands for Simple, A for Average, and C for Complex.

Table 4. Test Protocol.

Question Class

Microsoft English Query outcome

Elfsoft outcome

Comments

Who is the oldest

employee?

S Correct Correct English Query gave the oldest person, Elfsoft

the one who had worked the longest at

Northwind.Which supplier

(currently) supplies the

most products (which are not discontinued)?

C Correct Correct

Which employee has handled the most orders?

A No answer Correct Elfsoft gave too much

information

What product is the most frequently ordered?

S Correct No answer

List the country that

has a supplier that ships tofu.

A No answer Correct

Name the third most ordered

product.

S No answer No answer

What is the least ordered

product?

S Wrong No answer

How much is S Correct No answer

1kg of Queso Cabrales?

Question Class


Elfsoft outcome

Comments

How much tofu have been ordered?

A No answer Correct Elfsoft gave too much

informationShow the

phone number of united package.

S Correct Correct

Tell me the names of the

sales representative

s

S Correct No answer

Tell me the age of these

people.

A Correct No answer

And their phone

numbers?

A Correct Correct

Count the customers in

Germany.

S Correct Correct

What is the average age of

the employees?

A Correct Wrong

Name the employees that are older than

average

A Correct No answer

Give the name of the sales manager.

S Correct No answer

Where is Around the Horn from?

S Correct No answer

What is the median of the

age of the employees?

A No answer Wrong

List the names of the people

working currently in

S No answer Wrong

the company.Who is older than Janet?

S Correct No answer

Question Class


Elfsoft outcome

Comments

What can you tell me about Ernst Handel?

S Too little information

No answer

Which supplier supplies tofu

but not longlife tofu?

C Correct Correct

What are the contact names

and phone numbers of

customers that have received products sent with Federal

Shipping?

C No answer Wrong

What are the products that

federal shipping ships

A Correct Correct Microsoft English Query had the wrong interpretation.

What customers

received these shipments?

A No answer Wrong

report.doc

Documents

semantic parsing

syntactic parsing

probabilistic parsing

based systems

semantic grammar

natural language parsing

semantic representation

semantic data models