testing sql-compliance of current dbmss

Testing SQL-compliance of

current DBMSs

Elias Spanos

Master of Science

School of Informatics

University of Edinburgh

2017

Abstract

Database Management Systems (DBMSs) are widely used in various fields ranging

from online banking, web systems and other online services and are very crucial in

the day to day management of data. The correct and efficient operation of such sys-

tems is an important factor when choosing the proper DBMS software. In the past few

decades, the amount of data has increased exponentially, causing a rapid increase in

the demand of such systems that can store, organise and manipulate data. With these

requirements in mind, different vendors have implemented their own DBMS. However,

since these DBMSs have been implemented with a significant amount of differences,

a Standard was proposed in order to provide a common way of using such systems.

Even though, a Standard has been introduced and adopted by most DBMS vendors,

there are some differences that still exist, probably due to the difficulty of interpreting

and implementing all the parts of the Standard, and also due to other issues regarding

the performance of their systems. Hence, there is a need for alleviating such a prob-

lem by exposing such issues between DBMSs in order to evaluate their conformance

to the common Standard. In particular, this project investigates five popular DBMSs

implementations and evaluate their conformance to the SQL Standard. A crucial ques-

tion that will be investigated by conducting this project is whether DBMSs have been

implemented the SQL Standard in the same way. The SQL language should pledge

that identical SQL code should always return identical answers when it is evaluated

on the same database independently of which DBMS is running on. The aim of this

project is the implementation of a random query generator and a comparison tool for

investigating and highlighting the differences that may exist among current DBMSs.

Further, it aims to provide a detailed explanation in regards to the SQL Standard of

potential differences and explain how they might affect the transition between current

DBMSs.

i

Acknowledgements

I would like to thank my supervisors Paolo Guagliardo and Leonid Libkin who

were always willing to advise me and help me in order to overcome any difficulty. In

addition, I want to thank my family who is always by my side.

ii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Elias Spanos)

iii

Table of Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 52.1 The SQL Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 The SQL Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Commands of SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 SQL Standard issues . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 The Database Management Systems . . . . . . . . . . . . . . . . . . 15

3 Methodology 173.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Implementation 194.1 Internal Implementation . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Random Query Generator Tool . . . . . . . . . . . . . . . . . 20

4.1.1.1 Configuration file . . . . . . . . . . . . . . . . . . 21

4.1.2 Comparison Tool . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.3 Random data generator tool . . . . . . . . . . . . . . . . . . 24

5 Experimental Evaluation 265.1 The experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 27

5.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iv

6 Conclusion 466.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Summary of the findings . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 Suggestions for future work . . . . . . . . . . . . . . . . . . . . . . . 48

A Source Code 50

B Compilation & Execution Instructions 51

Bibliography 54

v

List of Figures

1.1 Abstract DBMS architecture . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Popularity of modern DBMSs . . . . . . . . . . . . . . . . . . . . . 15

3.1 High level overview of the framework . . . . . . . . . . . . . . . . . 18

4.1 Complete architecture of the framework . . . . . . . . . . . . . . . . 19

4.2 Abstract Architecture of Comparison Tool . . . . . . . . . . . . . . . 23

4.3 Method of importing csv files . . . . . . . . . . . . . . . . . . . . . . 25

5.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 27

vi

List of Tables

2.1 Aggregation Commands . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Complex Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 SET Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 String Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 OR Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 AND Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 NOT Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1 Statistics of Generated queries . . . . . . . . . . . . . . . . . . . . . 29

5.2 Difference #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Difference #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Difference #3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.5 Difference #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.6 Difference #5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.7 Difference #6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.8 Difference #7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.9 Difference #8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.10 Difference #9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.11 Difference #10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.12 Difference #11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.13 Difference #12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.14 Difference #13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.15 Difference #14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.16 Difference #15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.17 Difference #16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.18 Difference #17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vii

5.19 Difference #18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.20 Difference #19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.21 Difference #20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1 Summarised Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 Summarised Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

viii

Chapter 1

Introduction

Database Management Systems (DBMSs) have thrived since their first appearance,

and they remain the dominant manner for storing various kinds of information. Specif-

ically, DBMSs have been extensively used in many fields and has been widely used

by almost all companies as they provide a relatively easy way of performing various

operations on data, such as insertion, deletion and modification [15, 21, 22]. For this

reason, applications can be implemented efficiently and reliably without the need for

handling low-level issues such as concurrent and efficient access of data which it is

taken cared by DBMS. Hence, the primary role of a DBMS is not only to store data

but also to provide a common interface for manipulating it. Figure 1.1 shows from

a high-level point of view the structure of a modern DBMS. Moreover, the relational

model (RM) is the mainstream model for the DBMSs for managing data and it is by

far the most popular model for current DBMSs. It was proposed by Edgar F.Codd in

1969 [6, 7] and it has brought a revolution in the area of data management due to its

simplicity. A database that uses the relational model is known as a relational database.

According to the relational model, data is stored in a database as tables, and each table

is composed of rows and columns. Also, each row of a table is known as a tuple based

on RM and each column as characteristic or attribute.

In fact, in early stages, each database system had its own interface, and migrat-

ing applications from one system to another was a slow and difficult task, since an

almost complete rewriting of the code would be required. However, as these systems

were promising from their first appearance, a standardised language was unavoidable.

Structured Query language (SQL) has become a standardised language for querying

and managing data and has rapidly become the most widely used DBMS language [4].

Since then, comparable languages have been emerged over the years, however, SQL

1

Chapter 1. Introduction 2

Figure 1.1: Abstract DBMS architecture

has persisted to be the dominant language since it has been easy to learn. SQL users

and programmers can take advantage of SQL language in order to learn a new lan-

guage which is used by essentially all modern DBMSs, and they can write SQL code

that with minor changes it can be run on any database system. On the other hand, a sig-

nificant disadvantage of programming languages is that each language has its benefits

and usage and they cannot always run on any system.

In particular, such systems have been extensively studied and tested for their cor-

rectness and efficiency. Software testing is an essential approach for testing and eval-

uating the quality of such systems [5]. However, implementing a software that can be

used to assess complex systems, it usually requires sophisticated and large-scale im-

plementation. On the contrary, by automating the process of testing is the only way to

have a systematic and effective way of testing such complex systems. Since DBMSs

are the most popular systems for storing and manipulating data, many studies have

conducted focusing mainly for evaluating their performance. For example, TPC-H is

a popular benchmark for evaluating vendors DBMSs implementations [11]. However,

this benchmark is designed for analyzing performance and it generates a relatively

small number of SQL queries.


1.1 Motivation

The SQL Standard was established a long time ago, and it aims to provide a common

interface among modern DBMSs [2, 12]. As each vendor provides its own DBMS

implementation, the extend of which these systems implementations comply with the

SQL Standard is essential and needs to be studied.

Database Systems have evolved rapidly, and new features have been continuously

added. As a result, users, programmers, and companies that make use of such systems

are facing unprecedented problems when it comes to migrating their existing SQL

code to another DBMS as most of the time the SQL code is partially portable. Mainly

three major reasons are involved in the aforesaid problem. Firstly, DBMS vendors do

not always comply immediately with the latest Standard, since they have to cope with

many considerations such as performance issues and better system management, or

even sometimes it is not practical to implement the entire Standard. Therefore vendors

usually adopted new changes of the updated Standard into their next product releases.

Secondly, DBMS vendors offer their own feature in order to be distinctive among

other DBMSs which make SQL code less portable. Lastly, the Standard is written in a

natural language which makes the process of interpreting all the aspect in exactly the

same way almost impossible.

Currently, there is no well-known framework that can be used to test the SQL-

Compliance among these systems and be capable of detecting any incompatibilities or

semantic issues. Hence, this project aims to build a complete framework for system-

atically discovering and highlighting both existing and new differences that may arise

between popular DBMSs. Afterwards, experimental evidence with a comprehensive

explanation with respects to the SQL standard is provided. Unfortunately, DBMSs

do not always provide a meaningful message whenever an SQL code has a syntax er-

ror. As a consequence, it can take considerable time to detect and resolve any issue.

By conducting this project, it is also intended to disclose all the incompatibilities that

may exist and highlight all the differences, with such a way that it will make users,

programmers and vendors aware of the current problems.

1.2 Related Work

As DBMSs are necessary for many fields some work has been conducted for evaluating

whether different vendors have implemented various points of the Standard similarly.


In fact, some parts of the Standard is interpreted differently, and the SQL code is not

portable among these systems. Albeit some differences are demonstrated [4, 17, 20],

there is no well-known tool for evaluating these systems and automatically identify

any issues and incompatibilities. Queries that execute without raising any error but are

interpreted differently by DBMSs can lead to incoherent results. Some differences in

DBMSs behaviours have been detected by users from experience with various systems

and are reasonably well documented. But more subtle differences are harder to expose

and may go unnoticed. In this work, we present a tool that may assist users in this task.

1.3 Thesis Structure

The structure of the remainder of this thesis report is organized as follows:

Chapter 2 introduces SQL language, SQL Standards, the core commands of SQL

language and briefly describes the usage of the most important commands, and af-

terwards the main issues are introduced by illustrating problematic SQL queries. In

addition, an overview of the most popular DBMSs is provided.

Chapter 3 describes the main methodology for this thesis project and briefly ex-

plains the architecture which is implemented for systematically checking the SQL-

compliance of current DBMSs.

Chapter 4 provides a detailed explanation about the complete framework and an

explanation for each individual tool that the framework is composed of. Also, it briefly

discusses the main issues that are addressed.

Chapter 5 illustrates the main findings and provides an explanation about the dif-

ferences between current DBMSs with appropriate reference to the SQL standard.

Chapter 6 concludes this project with explanation of what is been achieved, and

a summarised table of all the differences is provided. Afterwards, futures ideas and

extensions are provided.

Chapter 2

Background

2.1 The SQL Standard

DBMS from its first appearance shows that it will be the dominant trend for storing and

managing data. Consequently, different implementations have emerged from various

vendors and inevitably, a standardised language should be implemented in order to

provide portability among current systems. If applications were implemented using

only SQL commands which are defined in that standard and vendors implemented

these commands in exactly the same way, then SQL code could be migrated on any

DBMS without the need to be adopted.

The first appearance of SQL language was in 1970 where IBM developed its first

prototype of Relational Database Management System (RDBMS) [7, 10]. Subse-

quently, the first SQL standard arose in 1986 by the American National Standard In-

stitute (ANSI) with the name of SQL-86 for bearing conformity among vendors im-

plementations. Since then, different flavors of the Standard have emerged for revising

previous versions or for adding new features such as SQL-87, SQL-89, ANSI/ISO

SQL-92 and ANSI/ISO SQL: 1999 which has also been approved by the International

Standards Organization (ISO) [2, 12, 14, 23, 25]. The SQL Standard has been con-

tinuously developed with the current version being SQL:2016 or ISO/IEC 9075:2016.

The ANSI SQL Standard is divided into several parts, and this project focuses on the

SQL/Foundation part. In particular, this part contains central elements of SQL. Ex-

planations about the findings are given according to SQL:2016 since it is the newest

version of the Standard, though this part remains the same in comparison with earlier

flavors.

5

Chapter 2. Background 6

2.2 The SQL Language

SQL operations are in the form of commands known as SQL statements. In particular,

SQL is composed of primarily two sub-languages, the data definition language (DDL)

and the data manipulation language (DML) [10, 13]. DML is a part of the SQL lan-

guage that is composed of a family of commands like any programming language and

it is used for the creation of queries for inserting, modifying and deleting rows in a

database. This sub-language is consisted of ’SELECT-FROM-WHERE’ commands as

to be fundamental for any query. In this project, we particularly focus on DML since

we want to generate a diversity of random SQL queries. Also, the SQL Standard sup-

ports more complex rather than just simple commands for performing various tasks on

data. For example, aggregation functions such as ’SUM’, ’MAX’, ’MIN’, and ’AVG’

are used in combination with ’GROUP BY’ and ’HAVING’ SQL statements. The pur-

pose of having such commands is to perform a calculation on specific columns in order

to return a value. For example the calculation of average salary of a department by per-

forming a ’GROUP BY’ based on all the salaries of the employees of that department.

DDL is also part of SQL language and can be used to create, modify, delete tables

and views, and usually DDL statements start with keywords ’CREATE’, ’DROP’ and

’ALTER’. Also, it supports a command that provides the capability of defining new

domains. Moreover, in general, tables and rows are denoted as relations and tuples.

Hence, SQL is extremely popular as it offers two capabilities. Firstly, it can access

many tuples using just one command and secondly it does not need to specify how to

reach a tuple, for example by using an index or not.

2.3 Commands of SQL

In this section, a high-level description of the basic SQL commands is given, as defined

in the Standard. Due to the fact that these commands are used to generate random

queries, it is imperative to give a short presentation of each command and its purpose.


SQL Basic Structure

SELECT [DISTINCT] columns_list

FROM tables_list

WHERE condition

condition := comparison1 AND comparison2

comparison1 OR comparison2

NOT comparison

term := attribute

value

comparison := term1 op term2(op ε {=, <>, <, > <=, >=})

term IS NULL

term IS NOT NULL

The SQL basic structure represents the basic SQL commands which are used to re-

trieve data from a database, based on some conditions that are specified in the ‘WHERE’

clause. Each SQL query should have at least ‘SELECT’ and ‘FROM’ clause. The

‘WHERE’ clause is optional. The basic SQL query is executed as follows: all the rows

of the tables list (‘FROM’ clause) are evaluated. Each row that satisfies the search

conditions is selected. Then, only the specified columns of the selected rows appear in

the output results. In addition, the ‘DISTINCT’ keyword is optional and can be used to

eliminate duplicate rows from the output results. Also, instead of having columns list

in the SELECT clause, the “*” keyword can be specified which indicates that all the

columns of the tables list will appear in the results. The purpose of the ‘WHERE’

clause is to filter the results, and normally SQL queries contain a ‘WHERE’ clause.

Conditions compare expressions using comparison operators. A comparison operator

is used to compare two expressions, and logical connectivities, such as AND, NOT and

OR are used to connect the comparisons.


Example of basic query:

SELECT St.Student_Name

FROM Students AS St

WHERE St.age >= 20 AND St.age <= 24

The above example is built based on the basic SQL structure. Thus, it retrieves all

students’ name who are between 20 and 24 years old.

SQL Basic aggregation Syntax:

SELECT [DISTINCT] Columns_list1

FROM Tables_list

WHERE Condition1

GROUP BY Columns_list2

HAVING Condition2

SQL query with aggregation is used to perform a calculation on specific columns

in order to return a value. GROUP BY and HAVING commands are used to perform

an aggregation. HAVING command is optional.

Table 2.1: Aggregation Commands

Operator Use

MIN() Finds the minimum value of a column

COUNT() Counts the total number of rows

MAX() Finds the maximum value of a column

SUM() Calculates the sum of all values of a column

AVG() Calculates the average of all values of a column

Aggregate commands can be used both in SELECT and HAVING clause, and in

combination with GROUP BY clause. The example below illustrates the proper use of

aggregation commands.


Example:

SELECT COUNT(St.student_id), St.Country

FROM Students AS St

GROUP BY St.Country

HAVING COUNT(St.student_id) > 3

The above query makes proper use of the aggregation commands. In particular, it

lists the number of students in every country where there are more than three students

in this Country.

Table 2.2: Complex Conditions

Operator Use

EXISTS

query

Returns true, if there is at least one row in the query

Row Op

ALL

Returns true, if all the comparisons with the operator OP return true

Row Op

ANY

Returns true, if at least one comparison with the operator returns true

Row Op IN

(=ANY)

Returns true, if at least an element exist in a given set

Example:

SELECT *

FROM Students AS St

WHERE St.Country IN ( U K , N e t h e r l a n d )

The query above retrieves all the students who come from UK and Netherland.


Table 2.3: SET Commands

Command Use

UNION [ALL] Returns the combination of the results of two SQL queries

INTERSECT

[ALL]

Return the rows that appear exactly the same in both SQL queries

EXCEPT [ALL] Return each row that appears to the first query but does not appear to

the second query

By default, SET commands remove duplicates in an SQL query result. However,

we can keep duplicates in the result, by adding the optional ALL keyword for any of

the commands above.

Example:

SELECT Country FROM Students

UNION

SELECT Country FROM Professor

The above query retrieves the Countries from where both Students and professors come

from.

Data Types:Since a variety of data is stored in a database, data types are introduced by the Standard,

and some of them should be implemented by all DBMS vendors. Data types indicate

the type of data that a column can hold. We provide below a detailed explanation of

the most important data types for our project; however, there are numerous data types

rather just only these.


Table 2.4: Data Types

Types Description

SMALLINT Can store a number from a range between -32768 and 32767

INT Can store a number from a range between -2147483648 and

2147483647

BIGINT Can store a number from a range between -9223372036854775808

and 9223372036854775807

VARCHAR Takes as input the length of a variable string which can contains up

to 255 characters

CHAR Takes as input the length of a fixed size string which can contains up

to 255 characters

Table 2.5: String Commands

Operator Use

TRIM() Returns the string without leading/trailing characters

SUBSTRING() Retrieves a subset of the given string

CONCAT() Concatenate two or more strings

REPLACE() Replaces a subset of a string with another string

LIKE Returns true, if an attribute matches with a pattern

Example:

SELECT

SUBSTRING ( S Q L S T A N D A R D , 1, 3 ) AS SQLExtraction

The SQL above query extracts a part of the initial string which is given as pa-

rameter to the SUBSTRING. The prototype of the substring function is as follows:

SUBSTRING(initial string, start position, end position). Thus, the output result is:

SQL

Operator PrecedenceOperator precedence is an important concept for understanding how an SQL query is

evaluated and it is also significant for determining if major DBMSs follow the same


operator precedence. There are cases where the expressions in the WHERE clause

are quite complicated, and the operator precedence defines in which order the expres-

sion should be evaluated. The sequence of operators is provided as follows; classified

according to their priority, starting from those with the highest priority:

• ()

• +, - ,

• *, /,

• = , >, <, >= , <= , <>

• NOT , AND

• ALL , ANY , IN , LIKE

• = - variable assignment.

Operators that have the same priority are evaluated from left to the right. In addi-

tion, parenthesis abrogate the priority of the rest operators and as a result expressions

that are enclosed by a parenthesis are evaluated first.

2.3.1 Missing values

This section aims to provide a basic background regarding null values and present the

problems that can be arisen from the presence of nulls in a database. In the chapter of

experiments, it is illustrated that many problems can appear. SQL uses null marker for

missing or unknown values in a database, and for that reason null is a reserved word

[24]. It is worth mentioning that null should not be confused either with zero value

or an empty string. However, Oracle’s database treats the empty string as null [3, 19],

and for this reason new issues are arisen. An important consideration is that it cannot

be tested if a value of a field is null using usual common comparison operators such

as <>, =, < and > but instead IS NOT NULL and IS NULL commands are used.

In general, the existence of nulls in a database is the fundamental source of issues and

incompatibilities among current DBMS [16, 18]. Also, the evaluation of each com-

parison with the existence of nulls is performed based on a three-valued logic (3VL)

which is an extension of the common Boolean logic. In Boolean logic, an expression

can be evaluated only to two values such as true and false, where the negation evaluate

to the opposite values. In contrast with 3VL, where there is an addition value called

unknown, and the opposite of it remains the same. In addition, all comparisons in-

volving null should be resulted to be unknown according to SQL Standard. Below it is

illustrated a truth table for the different comparisons with the suitable outcomes.


Table 2.6: OR Truth Table

OR True False Unknown

True True True True

False True False Unknown

Unknown True Unknown Unknown

Table 2.7: AND Truth Table

AND True False Unknown

True True False Unknown

False False False False

Unknown Unknown False Unknown

Table 2.8: NOT Truth Table

NOT

True False

False True

Unknown Unknown

In an SQL query, each tuple which is evaluated to true is returned in the output

result, and tuples that are evaluated to either false or unknown are not returned to the

output results. Comparisons involving nulls are considered as false in the WHERE

Clause. Hence, if we take into consideration that only Oracle’s database treats an

empty string as null then it can be realized that many problems can arise.

Examples

NULL = 10 is evaluated to Unknown

NULL = NULL is evaluated to Unknown

NULL <= 3 is evaluated to Unknown


FROM Students AS St

WHERE NULL <= 20


The above query will result in an empty set since for each row the WHERE clause

will be evaluated to unknown and none of the rows will appear in the output result.


FROM Students AS St

WHERE NULL = NULL

Logically it is expected that the NULL = NULL should be evaluated to true. However,

this expression is evaluated to unknown for every row, and the output result of this

query is an empty set.

2.4 SQL Standard issues

An example is provided below which demonstrates that indeed some aspects of the

Standard is implemented differently by different vendors.

Example:The following SQL query does not return identical output results on both Post-

greSQL and Oracle’s database [17].

Query Q1:

SELECT *

FROM ( SELECT S.A, S.A FROM S ) R

While query Q1 is expected to return identical results, independently of which

system is executed on, this is not the truth since this query will output a table with two

columns named A in PostgreSQL. On the other hand, in Oracle’s database, the query

will return a compile-time error. Indisputably, there are indeed differences between

current DBMSs [17].

It can supposed that in most of the cases these differences are minor but if we take

into account that these systems are used in many different fields, then small differences

might be critical.


2.5 The Database Management Systems

Microsoft SQL Server: is a relational database management systems (RDBMS) de-

veloped by Microsoft. Microsoft offers different edition of its product such as Stan-

dard, Enterprise and Express edition. Express edition is free. All the editions are

mainly available on Windows platform.

MySQL: is an open-source RDBMS which is freely available, and Oracle owns

it. MySQL is used by huge companies such as Google, Facebook and Youtube. It is

available on MacOS, Linux and Windows platforms.

PostgreSQL: is an open-source RDBMS which is trying to comply with the SQL

Standard, and it primarily focuses on portability. It is freely available on both Windows

and Linux platforms.

IBM DB2: is an RDBMS developed by IBM, and it is available on Linux and

Windows platforms.

Oracle Database: is an RDBMS developed by Oracle Corporation which is avail-

able on MacOS, Linux and Windows platforms.

Figure 2.1: Popularity of modern DBMSs

Figure 2.1 illustrates the most popular DBMSs between 2013 and 2017 using a

calculated score based on Google Trend, the number of mentions of DBMSs on the

web, and discussion of the systems in StackOverflow and DBA Stack Exchange [1]. It

can be seen that the popularity of Oracle database, MySQL, and Microsoft SQL Server

remain constant over the years with a relatively high score. Also, there is a steady rise


for PostgreSQL in the last three years. Lastly, IBM DB2 popularity has decreased

slightly in the last year.

Chapter 3

Methodology

3.1 Methodology

As mentioned earlier, the fundamental idea of the current project is to investigate

whether current DBMSs comply according to the Standard, and highlight the key is-

sues and incompatibilities among these systems.

The insight of our approach is to generate a massive number of random queries in

order to execute the same queries against numerous vendor’s DBMS, and then to com-

pare and verify the output results. The output result of each DBMS is expected to be

identical since all the systems will contain the same databases. In this way, the results

can be checked for compilation errors, mismatches, and issues/incompatibilities.

Thus, our main contribution is as follows:

• The implementation of a fully-configurable random query tool that can be used

to generate a diversity of syntactically correct SQL queries based on a database

schema.

• The implementation of a comparison tool that can be used for executing each

randomly generated query on all DBMSs and compare the output results.

• The use an open-source tool called Datafiller for generating realistic data, which

is finally imported into the databases.

• The tools, as mentioned earlier, can be used as the core framework to run multi-

ple experiments in order to detect issues/incompatibilities and compilation-time

errors.

• In the end, all the issues/incompatibilities and errors are presented with an ap-

propriate explanation to the Standard, together with a summarised table.

17

Chapter 3. Methodology 18

Figure 3.1: High level overview of the framework

In particular, we examine five popular DBMSs such as MySQL, IBM DB2, Mi-

crosoft SQL Server, PostgreSQL and Oracle database based on their SQL-Compliance.

A detailed explanation of the internal implementations and structures of the aforemen-

tioned tools is given in the following Chapter.

Chapter 4

Implementation

This chapter describes the implementation details of the framework by explaining the

different tools which were implemented.

4.1 Internal Implementation

Figure 4.1: Complete architecture of the framework

The implemented framework is composed of three different tools, as introduced

in the previous chapter. The random query generator tool and the comparison tool

are implemented in Java programming language, whereas the Datafiller tool is imple-

19

Chapter 4. Implementation 20

mented in the Python programming language which is an open-source project that can

generate random realistic data. This project uses Java programming language since it

is prevalent. It can also be used to build cross-platform systems and our implemented

framework consists of several Java classes.

4.1.1 Random Query Generator Tool

The random query generator tool (RQGT) is used to generate valid and syntactically

correct SQL queries based on a database schema. In particular, RQGT can automat-

ically retrieve the database schema either from PostgreSQL or MySQL. Even though

the RQGT runs in a stand-alone manner, it can also run in conjunction with the Com-

parison Tool (CT).

The phases of which the RQGT proceeds can be defined as follows:

• During the initialization phase, both the database schema and configuration file

are retrieved.

• During the run-time phase, the tool generates random SQL queries continuously,

based on database schema and parameters which are specified in the configura-

tion file.

• Each query is pipelined to the Comparison Tool, which is then executed against

the DBMSs.

Generating a very large number of syntactically correct SQL queries is a complex

task since many considerations must be taken into account. Since the SQL language

consists of numerous SQL commands and functions, some of them are simple in terms

of their usage such as SELECT...FROM...WHERE, where some others are more trick-

ier such as GROUP BY, HAVING or aggregation functions such as MAX, COUNT or

AVG. These SQL commands and functions need to be properly generated and syntacti-

cally correct. As a result, for generating meaningful and syntactically valid queries, the

implementation of the generator tool must take into consideration many factors (i.e.,

attribute naming, data type compatibility and selected columns).

In addition, the tool has been designed in a modular way for re-usability, and there-

fore new systems can be easily supported without the need of changing the whole

structure of the tools. Hence, an internal representation is implemented with each of

its classes responsible for generating a different SQL clause that contributes to the

overall query. For example, one of the java classes generates the SELECT clause.


Having different classes for each SQL statement makes it easier to extend the tool in

order to add new functionality and at the same time there is no need to change other

parts of the tool.

The final SQL query is converted to an SQL string which then is executed to the

current DBMSs with the contribution of the comparison tool. An important note is that

it is not feasible to generate SQL strings directly because the attribute names and data

types need to be tracked first. If SQL strings were to be used, it would not be possible

to check if a variable mentioned in the WHERE clause is also mentioned in the FROM

clause or if that variable comes as a parameter from an external query.

4.1.1.1 Configuration file

The RQGT is a fully-configurable tool in order to make it feasible to test various sce-

narios. A detailed explanation of all the parameters is given subsequently. Another

important decision that should be taken into account is the provision of relations and

attributes to the tool. An initial approach was to use them as parameters in the configu-

ration file. Although this approach works acceptably, the resulting tool lacks portabil-

ity, especially when a DBMS has a large number of tables and columns. Hence, It will

be time-consuming to give these parameters via the configuration file. Thus, an effi-

cient approach is to retrieve the whole schema from DBMSs automatically. As a result,

the implemented tool has the capability to automatically retrieve the whole schema for

either MySQL or PostgreSQL just by providing the credential for connecting to the

database either in the configuration file or as input parameters to the tool.

All the parameters are described as follows:

• relations and attributes = parameters are used to provide to the query generator

tool the tables and columns for generating SQL queries according to the database

schema. The current architecture has the capability to automatically retrieve the

whole database schema from any DBMS by just providing the credentials in the

configuration file such as user, pass and dbName parameters.

•MaxTablesFrom = parameter is used to set an upper bound of the number of tables

that a FROM clause can have. If the upper bound is greater than the total number

of tables in the schema, then the upper bound is automatically set to the total

number of tables.

•MaxAttrSel = parameter indicates an upper bound of projections in the SELECT


clause. In other words, it is the total number of attributes that can be selected in

the SELECT clause.

•MaxCondWhere = parameter represents the total number of comparison that the

WHERE clause can have.

• ProbWhrConst = parameter represents the probability of having comparisons with

constant or Null in the WHERE clause. Therefore, a number between zero and

one can be given for this parameter.

• RepAlias = parameter indicates the probability of having repetition of alias in the

SELECT clause.

• NestLevel = parameter represents the maximum level of nesting that a query can

have. For generating such a query many considerations should be taken into ac-

count. For example, we should track attributes for outer queries, because inner

query can access outer attributes or attributes from its FROM clause. The oppo-

site is not true, meaning that we cannot access attributes from an inner query.

• ArithCompSel = parameter represents the probability of having arithmetic compar-

isons in the SELECT clause.

• Dinstinct = parameter represents the probability of the DISTINCT command ap-

pearing in the SELECT clause.

• StringInSel = parameter indicates the probability of projecting an attribute of type

String or having String functions in the SELECT clause.

• StringInWhere = parameter represents the probability of having String compar-

isons in WHERE clause.

• Rowcompar = parameter represents the probability of having row comparisons (com-

parison with more than one attributes) in the WHERE clause.

• isNULL = parameter indicates the probability of having ISNULL or IS NOT NULL

conditions in the WHERE clause.

• isSelectAll = parameter indicates the probability of having * command in the SE-

LECT clause.

• compTool = parameter indicates the total number of queries that will be generated

and executed against the DBMSs. In particular, if the value given to this parame-

ter, is equal to one, then only one query is generated without to executing on any

DBMS. Otherwise, x (where x the value given to this parameter) random queries

will be generated and executed on DBMS.

It can be seen from the configuration file that many parameters of the random query

generator can be controlled but, on the other hand, it does not imply that the diversity of


SQL queries that can be generated is restricted. For example, an upper bound of tables

that appear in the FROM clause can be set. In that way, having an enormous table size

from Cartesian product can be avoided and in addition we can generate random queries

under various scenarios.

SQL Query generated by the random query generator

SELECT r51.c1 AS A1, (482/AVG(r51.c2) ) AS AGGR0

FROM r5 AS r51

WHERE (r51.c3 = ’Schmedeman Park’)

AND ’Heath Avenue’ < r51.c3

GROUP BY r51.c1

HAVING MAX(r51.c1) <= MIN(r51.c1)

4.1.2 Comparison Tool

The generated SQL queries are evaluated on five different DBMSs, as they were men-

tioned in the methodology chapter. The comparison tool (CT) takes into consideration

all the ways of accessing different databases so that the process of executing and com-

paring identical queries among all databases can be automated.

The CT is illustrated below and a detailed explanation of its internal implementa-

tion is provided.

Figure 4.2: Abstract Architecture of Comparison Tool


The tool is implemented in Java programming language and is one of the main

components of the complete framework. Specifically, the CT is used to compare the

results of each randomly generated query. As each DBMS uses various algorithms to

evaluate each query, sometimes the results are identical but the order or format differs.

As a consequence, for efficient comparison of the query results between DBMSs, the

following approach is used: Initially, the query results are stored in a LinkedList, so

that an efficient in-memory sorting algorithm can be performed, called Quicksort. In

that way, the results among the DBMSs should be the same and a row by row compar-

ison is performed to verify that they are identical. If a difference is found or a query

raises an error, on one of the tested DBMSs, then an explanation of the difference or

the error message is recorded in the log file.

An important consideration for implementing this tool is its ease of reuse and ex-

tensibility by other systems or software. As each system needs its own credentials

such as username, password and database name, a configuration file called compar-

isonTool.properties is created for providing these information to the CT. By doing so,

the CT becomes portable and adaptable and a new system can be easily added. As

shown in the experiments chapter, our approach can detect many important differences

and issues.

4.1.3 Random data generator tool

Generating random databases for testing DBMSs is a complex task since meaningful

data should be produced in order to reveal corner cases. However, if we take into

consideration that different interpretations or missing features can be still revealed,

then by generating a small size of data, we can increase the efficiency of our evaluation.

Datafiller is a well-known open-source project that provides the capability of gen-

erating random data [8, 9]. For our project, Datafiller will be used for generating a

diversity of database instances in order to evaluate all major DBMSs. In particular, the

Datafiller script generates random data, based on an SQL schema which is provided as

a parameter to the tool, and it takes into account constraints of the schema in order to

generate valid data. For example, it takes into account the domain of each field and if

each field should be unique, foreign key or primary key. Another important parameter

is the df: null=rate% which indicates the nullable rates. It is necessary to test the be-

haviour of current DBMSs containing databases with nulls in order to evaluate whether

all DBMSs behave in a similar fashion regarding null values.


Figure 4.3: Method of importing csv files

Additionally, more complex parameters can be provided as well, such as the num-

ber of tuples per table using –size SIZE parameter. It is worth mentioning that these

parameters should be defined within the schema script and should start with – df. Fur-

thermore, we can generate more realistic data by providing some information in the

schema SQL script. For instance, if there is a field which represents a date, then we can

provide a range in order for the tool to generate dates only within that range. This can

be achieved by specifying the following parameter: range – df: start=year-month-day

end=year-month-day next to the Date field. Subsequently, we need to add the –filter

parameter while running the script. These are only some of the important parameters

that the Datafiller provides but apart from these; it also provides more sophisticated

properties which are out of importance for our project.

The current version of Datafiller tool supports data import only for PostgreSQL and

a beta version for MySQL. Since for our project the data should also be imported into

Microsoft SQL Server, Oracle Database and IBM DB2, we transform it into a CSV

format and then, we use the bulk-loading command which is supported by all system

to important the csv file into the DBMSs.

Chapter 5

Experimental Evaluation

In this chapter is presented the procedure that was followed for providing the experi-

mental evidence. Subsequently, the main findings are presented with appropriate ex-

planation according to the SQL Standard.

5.1 The experiment Setup

The experiments are conducted on a server with the following specifications:

– Intel core i7-3632QM 2.20GHz

– 12GB Main memory Ram

– Solid state disk (SSD)

– Windows 8 operating system

– The following versions of DBMSs were installed: PostgreSQL Version 9.6,

Microsoft SQL Server Express Edition 2016, IBM DB2 Express-C, OracleDatabase 12c and MySQL Community Edition 5.7.

Our implemented framework (Random query generator tool and Comparison tool)

was used to conduct the experiments and in this way the current implementation is ver-

ified and checked whether is capable of detecting any differences and issues. A very

large number of SQL queries is generated and executed against the five aforementioned

DBMSs. In particular, more than 100,000 queries are generated and executed. Also,

different schemas are used varies from 2 to 5 relations and 2 to 5 attributes and the fol-

lowing data types were checked: TEXT, CHAR, VARCHAR, INTEGER, SMALL-INT, BIGINT. Furthermore, the generated data contained different null rates such as

5%, 10%, 20% for identifying possible differences that may arise due to the number of

26

Chapter 5. Experimental Evaluation 27

nulls. For each schema, a database instance is generated using the random data gen-

erator called Datafiller. As an important consideration that needs to be checked with

these experiments, is whether various features of the Standard are implemented in the

same way. The size of the instances was 100 rows per table since it will be aimless to

use a larger size, since issues and differences can be still disclosed with a smaller size.

In this way, the time that needs for the queries to be executed is increased, and more

queries can be evaluated in less time. We additionally generated more than a million

randomly queries to evaluate the performance of the random query generator tool.

5.1.1 Performance Evaluation

1 2 4 8 16 32

Conf1 100138 201123 401239 802521 1600133 3205674

Conf2 50069 100580 200630 401277 800069 1602725

Conf3 25034 50270 100313 200630 400033 801418

Conf4 16689 33520 66873 133753 266688 534279

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

NU

MB

ER

OF

GE

NE

RA

TE

D Q

UE

RIE

S

EXECUTION TIME ( MINUTES)

PERFORMANCE EVALUATION

Conf1 Conf2 Conf3 Conf4

Figure 5.1: Performance Evaluation

In Figure 5.1, we present the number of random queries that are generated over

time under different configuration settings such as Conf1, Conf2, Conf3, and Conf4.

The purpose of the performance evaluation is to provide an insight about the total

number of queries that can be generated over a specific time without being executed.

In particular, the performance evaluation test is conducted under different configura-

tion settings (Different values are assigned to the parameters in the configuration file).


Conf1 and Conf2 were configured in such a way that a low probability is given for

most of the parameters and zero probability is given for having string comparisons in

the WHERE clause and String functions in the SELECT clause. The only difference

between Conf1 and Conf2, is that Conf2 has a greater number of the total number of

comparisons parameter. Conf3 and Conf4 differ from Conf1 and Conf2, in such a way

that a high probability is given for having String comparisons in the WHERE clause

and String functions in the SELECT clause. As expected, fewer queries are generated

within a specific time under conf2 settings, as the number of comparisons is greater

compared to Conf1, which takes more time for the queries to be generated. Addition-

ally, under Conf3 and Conf4 settings, fewer queries are generated compared to Conf1

and Conf3, since by having String comparisons and String functions takes, even more

time for the queries to be generated. Overall, it can be seen that a very large num-

ber of random queries can be still generated faster, even under different configuration

settings.

5.2 Experiment Results

Below we present our findings, which have arisen using our framework and by gener-

ating more than 100,000 random queries and execute them against five DBMSs. The

findings have the following format:

• Firstly, the generated SQL query that causes an error or a semantic issue is pre-

sented ( Since some features of the Standard are not implemented at all, or im-

plemented differently can cause an error).

• Then, a table for each query is provided which demonstrates if any DBMS raises

an error, otherwise if it is executed as expected or not.

• Afterwards, a comprehensive explanation is given for each query according to

the SQL Standard.


Table 5.1: Statistics of Generated queries

Conf1 Conf2 Conf3 Conf4

Total Queries 49163 51254 43251 32568

Problematic Queries 984 1538 3027 2931

Percentage ofproblematic queries

2% 3% 7% 9%

Table 5.1 illustrates the collected statistics of the randomly generated queries which

are executed against five modern DBMSs. As described earlier, in the Conf1 and

Conf2, a low probability is given for most of the parameters (Zero probability is given

for string comparisons in the WHERE clause parameter and String functions in the

SELECT clause parameter). In Conf3 and Conf4 a high probability is given for string

comparisons and functions, in particular in Conf4 a very high probability is given for

strings. As it can be seen from the above table, fewer incompatibilities/issues arise if

the generated queries do not contain strings comparisons and functions. Furthermore,

a larger number of problematic queries arose in particular under Conf3 and Conf4,

since queries that contained string comparisons and functions cause more issues due

to the fact that these features of the Standard are implemented differently by various

vendors.

Difference 1:Q1:

SELECT r41.A AS A0

FROM r4 AS r41

WHERE r41.A > 5


Table 5.2: Difference #1

DBMS DBMS Message

Mysql Executed (Works as expected)

PostgreSQL Executed (Works as expected)

Microsoft

SQL Server

Executed (Works as expected)

Oracle [42000][933] ORA-00933: SQL command not properly ended

IBM DB2 Executed (Works as expected)

With respect to the Standard, alias feature is optional and it is not compulsory to be

implemented by the vendors of DBMSs. The purpose of this feature is to rename tables

or columns and can be used to both rename attributes in the SELECT clause and tables

in the FROM clause. Aliases are given using ’AS’ keyword. The Oracle database

supports the ’AS’ when defining column aliases, but it does not allow to use ’AS’

when defining table aliases (in the FROM clause), in contrast with the rest DBMSs,

which support the alias feature both to rename tables or columns. An important note

is that the rest SQL queries that are presented below are executed without the keyword

AS only on the Oracle Database.

Difference 2:Q2:

SELECT r21.A AS A0

FROM r2 AS r21

WHERE true

According to the Standard, the Boolean data type is composed of the values true

and false. It also comprises the value unknown, unless a NOT NULL constraint is

defined. Although Boolean data type is essential, it is defined as optional in the Stan-

dard, and as a result only a few DBMS implement this feature. PostgreSQL and IBM

DB2 have implemented this feature, in contrast with Microsoft SQL Server, Oracle

Database and MySQL which do not natively support the Boolean data type. If the

query Q2 is executed on Oracle database and MySQL, it will raise an error since true

keyword cannot be recognised.



DBMS DBMS Message



Microsoft

SQL Server

[S0001][4145] An expression of non-boolean type specified in a con-

text where a condition is expected, near ’true’

Oracle [42000][920] ORA-00920: invalid relational operator


Difference 3:Q3:

SELECT r41.A AS A0

FROM r4 AS r41

WHERE 1*1


DBMS DBMS Message


PostgreSQL [42804] ERROR: argument of WHERE must be type boolean, not

type integer

Microsoft

SQL Server


text where a condition is expected, near ’1’.



According to the Standard, each expression in the ’WHERE’ clause is evaluated to

a Boolean type, taking the values True, False and Unknown. Since the Boolean type is

not supported by all DBMSs, there are some variations. The most common variation

is with a bit, where 0 value represents false and 1 value represents true. The query Q3

performs a multiplication between two constants in the ’WHERE’ clause, and since it

cannot be evaluated to a Boolean type, it returns an integer type instead. It is expected


that this query will raise an error if it is executed on any DBMSs. However, MySql

and IBM DB2 execute this query without to raise any error. The reason is based on the

fact that if the number is positive, then, they implicitly interpret it as true, otherwise as

false. On the contrary, the rest three DBMSs will raise an error.

Difference 4:Q4:

SELECT NULL/NULL, 1/2, NULL-NULL

FROM r2 AS r21, r4 AS r41

WHERE r41.B > r21.B


DBMS DBMS Message


PostgreSQL [42725] ERROR: operator is not unique: unknown / unknown Hint:

Could not choose a best candidate operator. You might need to add

explicit type casts.

Microsoft

SQL Server


Oracle Executed (Works as expected)


According to the Standard, any comparison/operation involving null or an unknown

truth value should return a null output result. Hence, a division with null should return

a null. The evaluation of the NULL/NULL as can be seen from the query Q4 should

lead to a null value. However, PostgreSQL raised an error when executed this query,

in contrast with the rest DBMSs which are executed this query without any error. The

problem still exists in PostgreSQL with other arithmetic operators such as +, - and *.


Difference 5:Q5:

SELECT r21.B/3


WHERE NOT(NOT(r41.A <> 18 ) )


DBMS DBMS Message

Mysql Executed (Does not work as expected)


Microsoft

SQL Server


Oracle Executed (Does not work as expected)


The query Q5 is executed on a database that contained smallint data type for the

attribute r21.B. This query is executed without causing an error on any of the five

DBMSs. However, the return type of the output results differs slightly. In IBM DB2,

PostgreSQL and Microsoft SQL Server the output results for the column r21.B/3 are

returned as an integer type, in the contrast with MySQL and Oracle database, where

the results are returned as a decimal data type.

Difference 6:Q6:

SELECT (MIN(r41.B) % AVG(r41.A) )

FROM r4 AS r41

WHERE (10 >= 19 )

GROUP BY r41.A, r41.B



DBMS DBMS Message



Microsoft

SQL Server


Oracle [22019][911] ORA-00911: invalid % character


An important consideration is that arithmetic operations such as addition, multi-

plication and subtraction are supported in the SELECT clause. Another important

arithmetic operation which is also supported in the SELECT clause is the % (modulo).

However, the syntax of % operator differs in Oracles database. In particular, Oracle

uses a function called MOD, instead of using % which is supported by the rest DBMSs.

Thus, the query Q5 needs to replace the operator ’%’ with the function mod in order to

be syntactically correct on Oracles database, that is, MOD( MIN(r41.B), AVG(r41.A)).

Difference 7:Q7:

(SELECT r41.A AS A0

FROM r4 AS r41 )

EXCEPT ALL

(SELECT r21.A AS A0

FROM r2 AS r21, r4 AS r42 )



DBMS DBMS Message

Mysql [42000][1064] You have an error in your SQL syntax; check the

manual that,corresponds to your MySQL server version for the right

syntax to use near ’EXCEPT ALL


Microsoft

SQL Server

S0002][324] The ’ALL’ version of the EXCEPT operator is not sup-

ported.



EXCEPT ALL is an optional feature according to the Standard. The purpose of

this feature is to return all the rows from the outer query which are not presented in the

inner query without removing duplicates. In particular, MySql and Oracle database do

not support this operator, in contrast with PostgreSQL and IBM DB2 which support

the EXCEPT ALL operator.

Difference 8:Q8:

(SELECT r41.A AS A0

FROM r4 AS r41, r2 AS r21, r3 AS r31

WHERE NOT(r31.B <> r41.A ) )

EXCEPT

(SELECT r21.A AS A0


WHERE r41.A <> r41.B )



DBMS DBMS Message


manual that corresponds to your MySQL server version for the right

syntax to use near ’EXCEPT


Microsoft

SQL Server




EXCEPT is a mandatory feature according to the Standard and all DBMSs must

support this feature in order to comply with the Standard. The purpose of this feature

is to return all rows from the outer query which are not presented in the inner query and

removing duplicates. Only MySQL does not support the EXCEPT even though it is a

compulsory feature according to the Standard. However, Oracle database uses MINUS

instead of EXCEPT which has an identical behaviour. The rest DBMSs support this

operator.

Difference 9:Q9:

(SELECT r41.A AS A0


WHERE (NULL <= 6 OR NOT(r31.B <> r41.A ) ) )

INTERSECT ALL

(SELECT r21.A AS A0


WHERE r41.A <> r41.B OR ( 0 <> 14) AND r41.A > r21.B)



DBMS DBMS Message



syntax to use near ’ALL


Microsoft

SQL Server

[S0001][324] The ’ALL’ version of the INTERSECT operator is not

supported.

Oracle [42000][928] ORA-00928: missing SELECT keyword


With respect to the Standard INTERSECT ALL is an optional feature. The pur-

pose of this feature is to return all the rows which are presented in both inner and outer

queries result, without removing duplicates. MySQL, Microsoft SQL Server and Or-

acle database do not support this feature at all, in contrast with PostgreSQL and IBM

DB2 which support this feature.

Difference 10:Q10:

SELECT 7/0 AS ART0 , 1%NULL AS ART1

FROM r1 AS r11

WHERE (NULL = 19 AND r11.b <> 5)


DBMS DBMS Message


PostgreSQL [22012] ERROR: division by zero. (Works as expected)

Microsoft

SQL Server

[S0001][8134] Divide by zero error encountered. (Works as ex-

pected)

Oracle ORA-01476: divisor is equal to zero.(Works as expected)

IBM DB2 [22012][-801] Division by zero was attempted.(Works as expected)

A division by zero should not be allowed according to the Standard, and it should


always raise an error. However, MySQL does not raise an error and the out results of

the division by zero is NULL.

Difference 11:Q11:

SELECT r11.a AS A1, r11.b AS A2

FROM r1 AS r11

WHERE (r11.b, r11.a) IN (SELECT r12.a AS A4, r12.b AS A3

FROM r1 AS r12)


DBMS DBMS Message



Microsoft

SQL Server


text where a condition is expected, near ’,’.



The usage of IN predicate is to compare one or more attributes values with a set of

values. However, the query Q11 will raise an error on Microsoft SQL Server as more

than one value is specified in the IN predicate. The rest DBMSs execute the query

without raising any error.


Difference 12:Q12:

SELECT r31.b AS A1

FROM r3 AS r31

WHERE r31.a >= r31.a

GROUP BY r31.a


DBMS DBMS Message


PostgreSQL [42803] ERROR: column ”r31.b” must appear in the GROUP BY

clause or be used in an aggregate function Position: 9

Microsoft

SQL Server

[S0001][8120] Column ’r3.B’ is invalid in the select list because it

is not contained in either an aggregate function or the GROUP BY

clause.

Oracle [42000][979] ORA-00979: not a GROUP BY expression

IBM DB2 [42803][-119] An expression starting with ”B” specified in a SE-

LECT clause, HAVING clause, or ORDER BY clause is not spec-

ified in the GROUP BY clause or it is in a SELECT clause, HAV-

ING clause, or ORDER BY clause with a column function and no

GROUP BY clause is specified

”According to the Standard, an SQL query that includes a GROUP BY clause

cannot refer to non-aggregated columns in the select list that are not defined in the

GROUP BY clause.” The intuition behind this issue is based on the fact that if non-

grouped and non-aggregate fields are selected, then the DBMSs cannot know which

row to return since there can be more than one row for each corresponding group. The

query Q12 selects the attribute r31.b which does not appear in the GROUP BY clause.

PostgreSQL, Microsoft SQL Server, Oracle database and IBM DB2 correctly raise an

error as the attribute does not appear in the GROUP BY clause. However, MySQL

executes this query without raising an error. In particular, MySQL simply returns the

first row that exists for each group.


Difference 13:Q13:

SELECT ’SQL’ || ’STANDARD’

FROM R1


DBMS DBMS Message



Microsoft

SQL Server

[S0001][102] Incorrect syntax near ′|′.



According to the Standard, concatenation is a mandatory feature and it should be

supported by all DBMSs with a double-pipe mark ||. The purpose of this feature is

the concatenation of two or more strings into one. Thus, by executing the query Q13

is expected the output result to be SQLSTANDARD. However, MySQL supports the

double pipe operator, but interprets it as a logical OR and thus the query returns 0.

MySQL uses the CONCAT() function as concatenation and it takes as parameters one

or more strings. Microsoft SQL Server uses the + operator in order to perform the

concatenation. Only Oracle and PostgreSQL support the double-pipe || concatenation

operator.

Difference 14:Q14:

SELECT *

FROM r1 AS R1 , r1 AS R1



DBMS DBMS Message

Mysql [42000][1066] Not unique table/alias: ’R1’

PostgreSQL [42712] ERROR: table name ”r1” specified more than once

Microsoft

SQL Server

[S0001][1011] The correlation name ’R1’ is specified multiple times

in a FROM clause.


IBM DB2 Executed (Does not work as expected)

Query Q14 should raise an error since the same alias is assigned to two tables.

Thus, it is unambiguous to which table will refer to using the specific alias. However,

the query is executed on IBM DB2 database without to raise an error.

Difference 15:Q15:

SELECT TRIM(’ SQLSTANDARD ’)

FROM R1;


DBMS DBMS Message



Microsoft

SQL Server

[S00010][195] ’TRIM’ is not a recognized built-in function name.



According to Standard, TRIM function is a mandatory feature and it should be

implemented by all DBMSs. The use of TRIM feature is to return a string which

is given as argument with remove leading and trailing space. This function is not

supported by Microsoft SQL Server.


Difference 16:Q16:

SELECT 2 * 5 AS ART

WHERE ( 1 = 1 )


DBMS DBMS Message



syntax to use near ’WHERE (1 = 1 )


Microsoft

SQL Server


Oracle [42000][923] ORA-00923: FROM keyword not found where ex-

pected

IBM DB2 [42601][-104] Expected tokens may include: ”FROM”

The basic structure of an SQL query is of the form SELECT...FROM...WHERE.

The query, but the query Q16 omits the FROM clause. SQL query without FROM

clause can be executed only on PostgreSQL and Microsoft SQL Server, in contrast

with Mysql, Oracle and IBM DB2 which raise an error.

Difference 17:Q17:

SELECT SUBSTRING (’Standard’, 1, 4)

FROM R1



DBMS DBMS Message



Microsoft

SQL Server


Oracle [42000][904] ORA-00904: ”SUBSTRING”: invalid identifier


The substring function is a mandatory feature according to the Standard and as a re-

sult all DBMSs should implement this feature. The use of substring feature is to return

a part of the string which is given as argument. Only Oracle database does not support

this feature and it uses Substr instead of Substring which has a similar behaviour. The

prototype of oracle’s function is: substr(column name,start pos,noo f characters).

Difference 18:Q18:

SELECT "SQLSTANDARD"

FROM R1


DBMS DBMS Message


PostgreSQL [42703] ERROR: column ”SQLSTANDARD” does not exist Posi-

tion: 8

Microsoft

SQL Server

[S0001][207] Invalid column name ’SQLSTANDARD’.

Oracle [42000][904] ORA-00904: ”SQLSTANDARD”: invalid identifier

IBM DB2 [56098][-727] An error occurred during implicit system action type

”2”

According to the Standard a string should be encompassed by the ’ ’ symbol. It is

demonstrated in query Q19 that MySQL allows a string to be encompassed by both


’ ’ and ”” symbols which make the SQL code less portable as the rest DBMSs raise an

error if it is used ”” instead of ’ ’ symbol.

Difference 19:Q19:

SELECT *

FROM r1 AS R1

WHERE R1.c3 = ’’


DBMS DBMS Message



Microsoft

SQL Server


Oracle Executed (Does not work as expected)


The query Q19 is executed on all DBMSs without to raise an error, though the se-

mantic of this query differs. As mentioned in the background chapter, and in particular

in the missing value, Oracle database treats the empty string as NULL, in the contrast

with the rest DBMSs, which treats it as a normal String. Thus, all the comparisons in

the WHERE clause involving NULL are evaluated to Unknown which implies that the

output result will be empty since none of the rows will be satisfied. On the other hand,

if there is at least a row which contains an empty string, then by executing the above

query will be included in the output results. All the DBMSs except Oracle Database

return in the output results all the rows that contains an empty string in the column c3.

In Oracle database is returned an empty set as output result.


Difference 20:Q20:

SELECT *

FROM R1

WHERE r1.c3 LIKE ’standard%’


DBMS DBMS Message

Mysql Executed (Case-Sensitive)

PostgreSQL Executed (It is not Case-Sensitive)

Microsoft

SQL Server

Executed (Case-Sensitive)

Oracle Executed (It is not Case-Sensitive)

IBM DB2 Executed (It is not Case-Sensitive)

Query Q20 is executed on all DBMSs without to raise an error but it differs seman-

tically, since this query will not return the same results on all DBMSs. In particular, all

the tested systems contained a database that stored a field with the word STANDARD

(in capital letters) in the attribute c3 of the relation r1. By executing this query, it can

be observed that some of the DBMSs are case sensitive with the LIKE operator, such

as PostgreSQL and Oracle and as a result return an empty set as output. In the contrast

with the rest systems, which returned one row that is the row that contains the word

STANDARD in the attribute c3.

Chapter 6

Conclusion

6.1 Conclusion

In this project an entire framework is implemented and composed of the random query

generator tool and the comparison tool which are used to evaluate the SQL-compliance

of five DBMSs. Moreover, we verified the correctness of the implemented tools by

conducting the experiments. We demonstrated from the experimental evaluation chap-

ter that the implemented tools are competent to reveal crucial differences and issues

among current systems. Without a similar framework, it would be almost impossi-

ble to detect some of the differences and issues by generating queries manually, or

by testing all DBMSs empirically. As described in the related work, there is not any

similar framework, apart from some documentations provided by the vendors of re-

spective systems. There are also some studies presenting some issues according to the

Standard, but without integrating an automated tool.

In addition, a summarised table is provided exposing all the issues and incompati-

bilities between the most popular DBMSs. These issues/incompatibilities are of major

importance for vendors, users, programmers and researchers of such systems. We be-

lieve that this project has contributed in the research area of the database management

by introducing a new way of testing such systems according to their SQL-Compliance.

In addition, the implemented tools could be of major importance for future vendors or

researchers.

Furthermore, observing and analysing these incompatibilities makes users, pro-

grammers and researchers aware of these issues for avoiding crucial errors in the fu-

ture. We verified our assumptions that some parts of the Standard are implemented

differently, but it was somewhat unexpected that so many differences have emerged.

46

Chapter 6. Conclusion 47

Lastly, the implemented framework is portable and can be extended efficiently. Hence,

although experimental evidence is provided for both numeric and alphanumeric data

types, the random generator tool is implemented in such a way that can track any data

types such as Date. In this way, it could be extended efficiently to generate queries

with attributes of Date data type.

6.2 Summary of the findings

The below tables summarize the main features of the SQL language which are not

supported by all popular DBMSs.

Table 6.1: Summarised Results

Operation MysqlPostgreSQL

MicrosoftSQLServer

OracleIBMDB2

INTERSECTALL

X 4 X X 4

INTERSECT 4 4 4 X 4

AS in FROM

Clause4 4 4 X 4

EXCEPT ALL X 4 X

X

(MINUS ALL is

not supported)

4

EXCEPT X 4 4

4

(It uses MINUS

instead of EXCEPT)

4

GROUP BY con-

tains columns not

in SELECT list

4 X X X X

Arithmetic opera-

tions in WHERE

Clause

4 X X X 4

Support keyword

True in WHERE

clause

4 4 X X 4

Support of % op-

erator4 4 4

4

(It uses

mod function

instead of %)

4

Division by zero 4 X X X X

Row comparisons 4 4 X 4 4

Identical names

in the FROM

Clause

X X X X 4

Query without

FROM ClauseX 4 4 X X


Table 6.2: Summarised Results

Operation Mysql PostgreSQLMicrosoftSQLServer

OracleIBMDB2

|| operator in SE-

LECT Clause

4

(But It usage is

|| as logical OR),

(Uses CONCAT function

instead of ||)

4X

( But it uses + )4 4

TRIM function in

SELECT Clause4 4 X 4 4

&& operator in

SELECT Clause

4

(But it usage is

as logical AND)

4 4 4 4

SUBSTRING

function in

SELECT Clause

4 4 4

X

(It uses SUBSTR

function instead)

4

Enclose Strings

with “ ” instead

of ‘ ’

4 X X X X

LIKE Operator 44

(Case- sensitive)4

4

(Case- sensitive)4

The following additional conclusions are drawn by conducting our experiments.

Despite the fact that the Boolean data type is essential and used by the majority of

the developers, three major database systems (Microsoft SQL Server, Oracle database,

MySQL) do not provide native support for this data type. Thus, a migration of SQL

code into these systems may cause incompatibilities.

It was also observed that the behaviour of MySQL might be “weird” under certain

situations. In particular, it is the only system that allows a division by zero. Also, it

allows non-aggregated columns in the select list, which was something unexpected.

Lastly, the operators INTERSECT ALL and EXCEPT ALL are not supported by Mi-

crosoft SQL Server and MySQL.

6.3 Suggestions for future work

Several issues arose when DBMSs are tested. The experiments are conducted in var-

ious databases that contained integers and strings. We expect that more issues can


arise by also generating databases containing dates but by doing so, the generator tool

should be extended in order to support this new data type. This should be an easy ex-

tension as there is the provision for supporting any data type. Another future extension

could be to include a new DBMS for evaluation of its SQL-compliance. This extension

also should not need a lot of effort as the framework is implemented in such a way that

a new system can be easily added.

Appendix A

Source Code

The complete source code was submitted with the current dissertation report. The

source code can be also found in the following github repository:

https://github.com/spanoselias/TestingSQLCompliance

50

https://github.com/spanoselias/TestingSQLCompliance

Appendix B

Compilation & Execution Instructions

Compilation Instructions:Note: The Postgresql JDBC library is passed as an argument since the program

needs to retrieve the schema from PostgreSQL database. (Credentials need to be given

via configuration file or as arguments). In addition, the schema can be retrieved from

MySQL as well using ‘mysql-connector-java-5.1.41-bin.jar’

The following commands should be executed in the ‘src’ directory in conjunction with

PostgreSQL in order to retrieve the schema (The below commands should be copiedfrom readme TXT which was submitted along with this thesis report or fromgithub repository since by coping these commands straight from the pdf somesymbols like ′ might change)Linux:

javac -cp ’.:postgresql-42.1.1.jar’ SQLEngine.java

Windows:

javac -cp ’.;postgresql-42.1.1.jar’ SQLEngine.java

Execution Instructions:Linux:

javac -cp ’.:postgresql-42.1.1.jar’ SQLEngine.java

Windows:

javac -cp ’.;postgresql-42.1.1.jar’ SQLEngine.java

51

Appendix B. Compilation & Execution Instructions 52

The following parameters can be passed as arguments to the random query tool:-maxTablesFrom, -maxAttrSel, -maxCondWhere, -maxAttrGrpBy, -probWhrConst,

-repAlias, -nestLevel, -arithCompSel, -distinct, -stringInSel,-stringInWhere, -rowcompar,

-DBMS, -user, -pass, -dbName, -isNULL,-isSelectAll

Example1:

javac -cp ’.;postgresql-42.1.1.jar’ SQLEngine.java -maxTablesFrom

↪→ 5 -maxCondWhere 3 -maxAttrSel 4

Example2:

javac -cp ’.;postgresql-42.1.1.jar’ SQLEngine.java -maxtablesfrom

↪→ 5 -maxCondWhere 3 -maxAttrSel 4 -arithCompSel 0.5 -distinct

↪→ 0.5

Note: Any number of arguments can be passed to the program, and they are not

case-sensitive as it can be seen from the Example2.

The following steps should be followed in order to run the complete framework:

• Firstly, the environment should be set up. Hence, it should run on an environment

with the following DBMSs installed: PostgreSQL Version 9.6, Microsoft SQLServer Express Edition 2016, IBM DB2 Express-C, Oracle Database 12c and

MySQL Community Edition 5.7. (All the appropriate libraries are provided

with the source code under lib directory). And the appropriate credential should

be specified in the comparisonTool.properties.

• Both DBMS’ schemas and random generated data are provided under Script di-

rectory. Also, the open-source Datafiller can be used for generating realistic ran-

dom data (https://www.cri.ensmp.fr/people/coelho/datafiller.html#

downloade)

• Compilation instructions:

javac -cp ’All_libraries_names’ SQLEngine.java -compTool

↪→ noOfQueries

• Execution instructions:

https://www.cri.ensmp.fr/people/coelho/datafiller.html#downloade

https://www.cri.ensmp.fr/people/coelho/datafiller.html#downloade

Appendix B. Compilation & Execution Instructions 53

javac -cp ’All_libraries_names’ SQLEngine.java -compTool

↪→ noOfQueries

• A log file with all the differences will be generated once the program execution

is finished.

Bibliography

[1] DB Engines. https://db-engines.com/en/ranking_trend, 2017. Accessed:

2017-08-03.

[2] ISO/IEC9075-2. https://www.iso.org/standard/63556.html, 2017.

[3] Oracle. https://docs.oracle.com/cd/B19306_01/server.102/b14200/

sql_elements005.htm, 2017. Accessed: 2017-07-24.

[4] T. Arvin. Comparison of different sql implementations. Troels Arvin’s home

page, 2006.

[5] T. Y. Chen, F.-C. Kuo, R. G. Merkel, and T. Tse. Adaptive random testing: The

art of test case diversity. Journal of Systems and Software, 83(1):60–66, 2010.

[6] E. F. Codd. Derivability, redundancy, and consistency of relations stored in large

data banks. Technical report, IBM Research Report, 1969.

[7] E. F. Codd. A relational model of data for large shared data banks. Communica-

tions of the ACM, 13(6):377–387, 1970.

[8] F. Coelho. Datafiller–generate random data from database schema. https://

www.cri.ensmp.fr/people/coelho/datafiller.html. Accessed: 2017-06-

10.

[9] F. Coelho. DataFillerTutorial. http://blog.coelho.net/database/2013/

12/01/datafiller-tutorial.html, 2017. Accessed: 2017-06-01.

[10] T. M. Connolly and C. E. Begg. Database systems: a practical approach to

design, implementation, and management. Pearson Education, 2005.

[11] T. P. P. Council. Tpc benchmark c (standard specification, revision 5.11), 2010.

URL: http://www. tpc. org/tpcc, 2010.

54

https://db-engines.com/en/ranking_trend

https://www.iso.org/standard/63556.html

https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements005.htm

https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements005.htm

https://www.cri.ensmp.fr/people/coelho/datafiller.html

https://www.cri.ensmp.fr/people/coelho/datafiller.html

http://blog.coelho.net/database/2013/12/01/datafiller-tutorial.html

http://blog.coelho.net/database/2013/12/01/datafiller-tutorial.html

Bibliography 55

[12] C. J. Date and H. Darwen. A Guide to the SQL Standard, volume 3. Addison-

Wesley New York, 1987.

[13] S. Deconinck. Database systems: models, languages, design, and application

programming. 2011.

[14] A. Eisenberg, J. Melton, K. Kulkarni, J.-E. Michels, and F. Zemke. Sql: 2003 has

been published. ACM SIGMoD Record, 33(1):119–126, 2004.

[15] R. Elmasri. Fundamentals of database systems. Pearson Education India, 2008.

[16] P. Guagliardo and L. Libkin. Making sql queries correct on incomplete databases:

a feasibility study. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI

Symposium on Principles of Database Systems, pages 211–223. ACM, 2016.

[17] P. Guagliardo and L. Libkin. A formal semantics of sql queries, its validation,

and applications. Proceedings of the VLDB Endowment, 10(8), 2017.

[18] T. Imielinski and W. Lipski Jr. Incomplete information in relational databases.

Journal of the ACM (JACM), 31(4):761–791, 1984.

[19] H.-J. Klein. How to modify sql queries in order to guarantee sure answers. ACM

SIGMOD Record, 23(3):14–20, 1994.

[20] K. Kline, B. Hunt, and D. Kline. SQL in a nutshell: a desktop quick reference. ”

O’Reilly Media, Inc.”, 2004.

[21] R. Ramakrishnan. Database management systems . pdf. 2000.

[22] J. D. Ullman, H. Garcia-Molina, and J. Widom. Database systems: The complete

book, 2002.

[23] R. F. Van Der Lans. The SQL standard: a complete guide reference. Prentice

Hall International (UK) Ltd., 1989.

[24] R. van der Meyden. Logical approaches to incomplete information: A survey. In

Logics for databases and information systems, pages 307–356. Springer, 1998.

[25] F. Zemke. What. s new in sql: 2011. ACM SIGMOD Record, 41(1):67–73, 2012.

testing sql-compliance of current dbmss

Documents