1 schema-based natural language semantic parsing niculae stratica and bipin c. desai department of...

1

Schema-based Natural Language Semantic

Parsing

Niculae Stratica and Bipin C. DesaiDepartment of Computer Science

Concordia University1455 de Maisonneuve Blvd. West, Montreal, H3G 1M8, Canada

[email protected], [email protected]

2

Introduction 1/2

This paper addresses the mapping of Natural Language to SQL queries.

It details a methodology to build the SQL query based on:

The input sentenceA dictionaryA set of production rules

3

Introduction 2/2

Context-Free Grammar

I nputLanguage(English)

TargetLanguage

(SQL)

DictionaryProduction

Rules

Semantic Sets I ndex Files

4

Summary

Early workTemplate based parsingToken matching parsingImplementationCapabilities and LimitationsFuture work

Ref. [3] Minker, J., Information storage and retrieval - a survey and functional description, SIGIR, 12, pp.1-108, 1997 5

Early workNatural Language Processing through learning

algorithms and statistical methods

Internalrepresentation

Input Language(English)

Syntax AnalysisSemanticanalysis

Target Language(SQL)

Learning Algorithms Statistical methods

Ref. [2] N. Stratica, L. Kosseim and B.C. Desai, NLIDB Templates for Semantic Parsing. Proceedings of Applications of Natural Language to Data Bases, NLDB’2003, pp. 235-241, June 2003, Burg, Germany.

6

Template-based parsing (NLIDB)

User input

Token parsingand tagging

Syntacticanalysis

Semanticanalysis

Build set ofqueries

Extractanswers

LinkParser

Data

Schema

Pre-processor

Domain-specifi cinterpretation rules

User-defi nedVocabulary rules

WordNet

Semantictemplates

Ref. N. Stratica and B.C. Desai, Schema-based Natural Language Semantic Mapping. Proceedings of Applications of Natural Language to Data Bases, NLDB’2004, June 2004, Manchester, England.

7

Token-matching parsing

User input

Token parsing

TokenMatching

Build SQLquery

Extractanswers Data

Schema

Pre-processor

Semantic SetsI ndex Files

Production Rules

WordNet

WordNet

8

Template versus Token-matching parsing

User input

Token parsing

TokenMatching

Build SQLquery

Extractanswers Data

Schema

Pre-processor

Semantic SetsI ndex Files

Production Rules

WordNet

WordNet

User input

Token parsingand tagging

Syntacticanalysis

Semanticanalysis

Build set ofqueries

Extractanswers

LinkParser

Data

Schema

Pre-processor

Domain-specificinterpretation rules

User-definedVocabulary rules

WordNet

Semantictemplates

9

From the NL input to the SQL query

Semantic SetsIndex Files

Production Rules

Internalrepresentation

Input Language(English)

TokenizerSemanticanalysis

Target Language(SQL)

Databaserecords

Databaseschema

WordnetDerivationally related

forms

10

The Database Schema

Table AUTHORS (LONG id PRIMARY_KEY, CHAR name, DATE DOB)

Table BOOKS (LONG id PRIMARY_KEY, CHAR title, LONG isbn, LONG pages, DATE date_published, ENUM type)

Table BOOKAUTHORS (LONG aid references AUTHORS, LONG bid references BOOKS)

Table STUDENTS (LONG id PRIMARY_KEY, CHAR name, DATE DOB)

Table BORROWEDBOOKS (LONG bid references BOOKS, LONG sid references STUDENTS)

Ref [1] Miller, G., WordNet: A Lexical Database for English, Communications of the ACM, 38 (1), pp. 39-41, November 1995 11

Wordnet and the Semantic Sets 1/2

For table AUTHORS WordNet returns the following list:

WordNet 2.0 Search

Overview for ‘author’The noun ‘author’ has 2 senses in WordNet.

1. writer, author -- (writes (books or stories or articles or the like) professionally (for pay))

2. generator, source, author -- (someone who originates or causes or initiates something; ‘he was the generator of several complaints’)

12

Wordnet and the Semantic Sets 2/2

Results for ‘Synonyms, hypernyms and hyponyms ordered by estimated frequency’ search of noun ‘authors’2 senses of authorSense 1 writer, author => communicatorSense 2 generator, source, author => maker, shaper

The semantic set for AUTHORS becomes:

{writer, author, generator, source, maker, communicator, shaper}

13

Indexing the ENUM Types

TABLE=BOOKSATTRIBUTE TYPE ENUM={NOVEL, POEM}

If any of the TYPE values occurs in the input sentence, a new production rule is added to the SQL relating BOOKS.TYPE to VALUE such as the one in the example below:

Sentence: “List all novels”NOVEL is a valid value for BOOK.TYPEThe production rule is:

WHERE BOOKS.TYPE=‘NOVEL’

The resulting SQL query is:SELECT BOOKS.* FROM BOOKS WHERE BOOKS.TYPE=‘NOVEL’

14

The Production Rules

Based on the Database schema, the pre-processor builds the following production rule:

IF AUTHORS in Table List AND BOOKS in Table List Then BOOKAUTHORS is in Table List

and the following SQL template:

SELECT Attribute List FROM Table ListWHERE BOOKAUTHORS.AID=AUTHORS.ID AND BOOKAUTHORS.BID=BOOKS.ID

15

The Workflow

Run time engine

Dictionary

Preprocessor

Databaserecords

NL Input:Show all

books writtenby MarkTwain

tokenize

Show

all

books

written

by

Mark

Twain

No match

No match

No match

Database Schema

Table BOOKS

Table AUTHORS

TableBOOKAUTHORS

BOOKS SemanticSet

AUTHORSSemantic Set

Wordnet

ENUM Index file

TEXT Index file

LONG Index file

DATE Index file

Match

Match

Match

SQLProduction

Rules

FROM BOOKS,AUTHORS,

BOOKAUTHORS

SELECT..FROM.. WHEREBOOKAUTHORS.AID=AUTHOR.

ID ANDBOOKAUTHORS.BID=BOOKS.I

D

SELECT AUTHORS.*, BOOKS.*FROM.. WHERE

AUTHORS.NAME="Mark Twain"

TableSTUDENTS

No match

16

‘Show all novels written by Mark Twain and William Shakespeare that have been borrowed by John Markus’

The method retains the following tokens: ‘... ... novels ... .. Mark Twain ... William Shakespeare ... ... ...

borrowed ... John Markus’

The token ‘novels’ point to table BOOKS. ‘novels’ is found in the INDEX files for ENUM values of the attribute BOOKS.TYPE

Capabilities and Limitations 1/5

17

‘novels’ matches the ENUM value BOOKS.TYPE=’NOVEL’

‘Mark Twain’ matches the AUTHORS.NAME=’Mark Twain’

‘William Shakespeare’ matches the AUTHOR.NAME=’William Shakespeare’

‘borrowed’ is disambiguated at run time through BORROWEDBOOKS

‘John Markus’ matches STUDENTS.NAME=’John Markus’

Schema correlates AUTHORS, BOOKS and BOOKAUTHORS

Schema correlates STUDENT, BOOKS and BORROWEDBOOKS

The table list is: AUTHORS, BOOKS, BOOKAUTHORS,STUDENTS, BORROWEDBOOKS


18

The SQL Constraints are:

BOOKAUTHORS.AID=AUTHORS.ID AND BOOKAUTHORS.BID=BOOKS.IDAND BOOKS.TYPE=’NOVEL’AND (AUTHORS.NAME=’Mark Twain’ OR AUTHORS.NAME=’William Shakespeare’)AND STUDENTS.NAME=’John Markus’AND BOOKS.ID=BORROWEDBOOKS.BID AND STUDENT.ID=BORROWEDBOOKS.SID

The two constraints to the AUTHORS.NAME have been OR-ed because they point to the same attribute.

The method allows to construct the correct SQL query.

The method can address context and value ambiguities


19

1. The current architecture does not support operators such as: greater then, less then, count, average and sum.

2. It does not resolve dates as in: before, after, between.

3. The generated SQL does not support imbricate queries.

4. The proposed method eliminates all tokens that cannot be matched with either the semantic sets or with the index files and it works for semantically stable databases.

5. The preprocessor must be used after each semantic update of the database in order to modify the index files.


20

6. The context disambiguation is limited to the semantic sets related to a given schema.

7. Errors related to tokenizing, WordNet and the human intervention propagate in the SQL query.

8. The method completely disregards the unmatched tokes and thus it cannot correct the input query if it has errors.

9. However, the method correctly interprets the tokens that are found in the semantic sets or among the derivationally related terms at run time


21

Future WorkThe future work will focus on the operator resolution. We believe that the approach presented in this paper can give good results with a minimum of effort in implementation and avoids specific problems related to the various existing semantic analyses approaches. This is partly made possible by the highly organized data in the RDBMS.

The method will be implemented and the results will be measured against complex sentences involving more than 4 tables from the database. A study will be done to show the performance dependency on the size of the database records and on the database schema.

22

References[1] Miller, G., WordNet: A Lexical Database for English, Communications of the ACM, 38 (1),

pp. 39-41, November 1995[2] N. Stratica, L. Kosseim and B.C. Desai, NLIDB Templates for Semantic Parsing.

Proceedings of Applications of Natural Language to Data Bases, NLDB’2003, pp. 235-241, June 2003, Burg, Germany.

[3] Minker, J., Information storage and retrieval - a survey and functional description, SIGIR, 12, pp.1-108, 1997

[4] Stuart H. Rubin, Shu-Ching Chen, and Mei-Ling Shyu, Field-Effect Natural Language Semantic Mapping, Proceedings of the 2003 IEEE International Conference on Systems, Man & Cybernetics, pp. 2483-2487, October 5-8, 2003, Washington, D.C., USA.

[5] Lawrence J. Mazlack, Richard A. Feinauer, Establishing a Basis for Mapping Natural-Language Statements Onto a Database Query Language, SIGIR 1980: 192-202

[6] Allen, James, Natural Language Understanding, University of Rochester 1995 The Benjamin Cummings Publishing Company, Inc. ISBN: 0-8053-0334-0

[7] Kathryn Baker, Alexander Franz, and Pamela Jordan, Coping with Ambiguity in Knowledge-based Natural Language Analysis, Florida AI Research Symposium, pages 155-159, 1994 Pensacola, Florida

[8] Hirst G., Semantic Interpretation and the Resolution of Ambiguity, Cambridge University Press 1986, Cambridge

[9] Latent Semantic Analysis Laboratory at the Colorado University, http://lsa.colorado.edu/ site visited in March 2004

[10] Sleator D., Davy Temperley, D., Parsing English with A Link Grammar,Proceedings of the Third Annual Workshop on Parsing Technologies, 1993

http://www.cs.concordia.ca/Publications/nldb2003.pdf

http://www.eng.miami.edu/~shyu/Paper/2003/SMC03.pdf

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/f/Feinauer:Richard_A=.html

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/f/Feinauer:Richard_A=.html

http://www.informatik.uni-trier.de/~ley/db/conf/sigir/sigir80.html#MazlackF80

http://lsa.colorado.edu/







1 schema-based natural language semantic parsing niculae stratica and bipin c. desai department of...

Documents