understanding databases and querying

39
Understanding Databases and Querying USMAN SHARIF

Upload: usman-sharif

Post on 07-Dec-2014

367 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Understanding databases and querying

Understanding Databases and QueryingUSMAN SHARIF

Page 2: Understanding databases and querying

History of databases?

Need to structurally organize data.

Various different models to fulfill this need.

Most common technique is called Relational Modelling

The databases supporting relational model are called Relational Database Management Systems (RDBMS).

Page 3: Understanding databases and querying

Relational Model

All data is represented in terms of tuples.

A tuple is an extension to a pair. A pair is between two items, and a tuple is between N items where N is a countable number.

Tuples are grouped into relations.

In mathematical terms, a relational model is based on first-order predicate logic.

Page 4: Understanding databases and querying

Example of Tuples and Relations

Assume a road repair company wants to track their activities on different roads.

Lets restrict their activities to ‘Patching’, ‘Overlay’ and ‘Crack Sealing’.

The company had overlaid I-95 on 1/12/01 and I-66 on 2/8/01.

How can we represent this information in a relational model using tuples?

First, we see that there are two distinct things here Activities

Work

Next we define tuples for both of these items as follows: Activities = {activityName}

Works = {activity, date, routeNumber}

Page 5: Understanding databases and querying

Example of Tuples and Relations

We see a relation between Activities and Work – the activity that is to be performed.

In relational model we use the concept of ‘keys’ to describe the relationship between different tuples.

In our example, activityName can act as a ‘key’ to describe the relation that can be named as ‘ActivityWorks’.

For optimization reasons, keys are generally of numeric type. Therefore we modify Activities and Works to add a numeric ID Activities = {activityId, activityName}

Works = {activityId, date, routeNumber}

Page 6: Understanding databases and querying

Describing the example graphically

Page 7: Understanding databases and querying

Relational Databases

Relational Modelling is a mathematical concept.

When we translate this mathematical concept into RDBMS system we describe tuples as rows, items in tuples as columns and a group of ruples as tables. Relations are called relations in RDBMS terminology as well.

The example of our road repair company when translated into RDBMS would have two tables as follows:

Table Name: Activities. Columns:

activityId (Primary Key, number type)

activityName (string type)

Table Name: Works. Columns:

activityId (Foreign Key, number type)

date (date type)

routeNumber (string type)

It would have the following relation:

Relation Name: ActivityWorks. Participating Columns:

Primary Table: Activities. Key Column: activityId

Secondary Table: Works. Key Column: activityId

Page 8: Understanding databases and querying

More on Relations

The relation in the previous example is commonly called a ‘one-to-many’ or a ‘Master-Child’ relationship.

There are a total of three relationships: One-to-One: For a row in primary table there can be at most one row in

secondary table. Commonly used to spread a single tuple across two tables based on logical reasoning.

One-to-Many: For a row in primary table there can be multiple rows in secondary table. Commonly used to reduce redundancy or duplication of same data.

Many-to-Many: For multiple rows in primary table there can be multiple rows in secondary table. Used to describe complex relationships.

Relations are always directional.

Page 9: Understanding databases and querying

Querying databases

Databases provide an interface to define and manipulate data. It is called queries.

There are two types of queries: Data Describing Language (DDL) queries. They are used to create and

modify database structure. DB structure is called a schema definition.

Data Manipulating Language (DML) queries. They are used to query the data base for data.

There are four major DML queries: SELECT

INSERT

UPDATE

DELETE

Page 10: Understanding databases and querying

SELECT Query

A SELECT query is the way to fetch data from a database.

At a minimum, it has two parts (called clauses): The SELECT clause

The FROM clause

For example:

SELECT activityId, activityName

FROM Activities;

This query would return all rows in Activities table.

Apart from SELECT and FROM clauses, there are a number of other clauses that are optional. These include (but not limited to):

WHERE

ORDER BY

GROUP BY

Page 11: Understanding databases and querying

SELECT Query – The SELECT Clause

It enables you to define the columns you want.

Sometimes you want all columns, in those cases you can use the wildcard operator (*). For example, the previous query can be modified as:

SELECT *

FROM Activities;

A good practice is to name the columns rather than using *

The primary use of SELECT clause is to define a projection – a subset of columns, so that the result can be restricted to such columns only.

Page 12: Understanding databases and querying

SELECT Query – FROM Clause

This is where you tell the database the name of table(s) where it should look for the columns you named in the SELECT clause.

When fetching data from multiple tables, list all tables and describe the relation between them. For example, let us try to fetch data for all the activities that have been performed on various routes along with dates.

SELECT Activities.activityName, Works.date, Works.routeNumber

FROM Activities INNER JOIN

Works ON Activities.activityId = Works.activityId

Notice the keywords ‘INNER JOIN’ and the part ‘ON Activities.activityId = Works.activityId’.

The ON … part tells the database what are the columns to match results on. It is also called the join condition. There can be more than one joining condition depending on the underlying database schema.

Page 13: Understanding databases and querying

SELECT Query – FROM Clause - JOINs

JOIN is a keyword that allows you to let the database know that there are multiple tables you intend to fetch data from.

There is a table mentioned before JOIN and another after it.

The one before is called the left table and the one after is called the right table.

There are three types of joins: INNER JOIN

LEFT OUTER JOIN

RIGHT OUTER JOIN

Page 14: Understanding databases and querying

INNER JOIN

INNER JOIN is also sometimes called a ‘strict’ join.

Some RDBMS systems support dropping the ‘INNER’ and implicitly assume it.

This type of join means that for each row in the left table find the rows in the right table and skip if there is no match found.

This type of joins helps in eliminating empty records.

For example, in our road repair example, it would omit all such Activities rows that don’t have records in Works table.

Page 15: Understanding databases and querying

OUTER JOINs

In case we don’t want to omit empty records, we can use OUTER JOINs.

A LEFT OUTER JOIN suggests that for each row in left table find all rows in right table.

A RIGHT OUTER JOIN suggests that for each row in right table find all rows in left table.

For example, let us find all Activities and related Works. We can do this by:

SELECT Activities.activityName, Works.date, Works.routeNumber

FROM Activities LEFT OUTER JOIN

Works ON Activities.activityId = Works.activityId

This query would return all Activities along with their associated Works. For the Activities that don’t have corresponding Works it would put ‘NULL’ under date and routeNumber columns.

Page 16: Understanding databases and querying

The JOIN Conditions

The ON … part is called the joining condition.

It is essentially an assertion condition describing column on the left and right tables and the way they are to be evaluated.

In most circumstances, there are columns (from left and right tables) that are matched with an = operator, however, in some cases that might not be true.

Other conditional operators such as not equal, greater than, less than, etc. are also supported.

There can be more than one JOINing conditions.

Page 17: Understanding databases and querying

QUESTION: What would happen if you skip the ON … part?

Page 18: Understanding databases and querying

SELECT Query – WHERE clause

WHERE clause allows you to describe conditions on the data you want fetched.

For example, if we are interested in all Overlaying Works we’ll write a query:

SELECT *

FROM Works

WHERE activityId = 24

Another way to do the same without using an ID is:

SELECT Works.*

FROM Works INNER JOIN

Activities ON Works.activityId = Activities.activityId

WHERE Activities.activityName = ‘Overlay’

However, the second example would be a bit slow and non-optimal because there is a certain overhead of joining and matching on string columns.

Page 19: Understanding databases and querying

SELECT Query - ORDER BY Clause

Theoretically speaking, the records in a table are unordered. However, most RDBMS usually store them in some kind of ordering (usually in the order of Primary key column).

In any case, there might be a requirement to order the results in a particular way.

ORDER BY clause allows you to describe data ordering and the direction of ordering.

For example, if we want all Activities along with their associated Works ordered alphabetically and sorted by date in a descending order, we can do that by:

SELECT Activities.activityName, Works.date, Works.routeNumber

FROM Activities INNER JOIN

Works ON Activities.activityId = Works.activityId

ORDER BY Activities.activityName ASC, Works.date DESC

The ASC keyword is implicit and can be skipped.

Page 20: Understanding databases and querying

Aggregating Results

Sometimes we want to fetch aggregated results. For example, we want to find out the number of times each Activity has been carried out from the road repair example.

The GROUP BY clause provides this functionality.

SELECT Activities.activityName, COUNT(Works.routeNumber) AS countActivity

FROM Activities INNER JOIN

Works ON Activities.activityId = Works.activityId

GROUP BY Activities.activityName

COUNT is an aggregate function. Others commonly used aggregate functions are SUM, AVG, MIN and MAX.

Page 21: Understanding databases and querying

SELECT Query – GROUP BY Clause

When a GROUP BY clause is defined then every column in the SELECT and ORDER BY clauses either need to be part of an aggregate function or mentioned in the GROUP BY clause.

For example, the following query is invalid:

SELECT Activities.activityName, Works.date,

COUNT(Works.routeNumber) AS countActivity

FROM Activities INNER JOIN

Works ON Activities.activityId = Works.activityId

GROUP BY Activities.activityName

Page 22: Understanding databases and querying

Sub-queries

A SELECT query works on a table or a group of tables, meaning tables are the operands for a SELECT operation.

The output of a SELECT query is (a kind of) a table.

Therefore, an output of a SELECT query can act as an input/operand for another SELECT query.

Page 23: Understanding databases and querying

Why use sub-queries?

Query optimization by breaking a large/complex query into smaller queries that use WHERE clauses to reduce the data size.

Retrieving single valued records for related tables based on values on some other columns in another query. Such as retrieving most recent (or oldest) record in a table that holds data for single record with updates over a period of time.

The above point is a reference to a common data warehousing use case of storing data that changes over time and you want to preserve these over the time changes.

Sometimes also referred to as Slowly Changing Dimension (SCD)

Using a sub-query in a WHERE clause to specify a match on a range of values.

Page 24: Understanding databases and querying

Sub-queries for optimization

Assume that we have a service with one million users.

There are only about 100,000 users that have spent money on our service.

Of the 100,000 users, only about 1,000 users have ever spent 100 dollars or more in one go.

We would most likely have a database with the tables as shown in the diagram

Page 25: Understanding databases and querying

Sub-queries for optimization

You are required to analyze transcations with amount greater than 100 dollars.

Write down the query that fetches users (userId, name, gender, country) and their transactions (transactionDate, amount).

A sub-optimal query follows on the next slide but don’t peak ahead. Write down one yourself and compare with it later.

Page 26: Understanding databases and querying

Sub-queries for optimization

SELECT users.userId, users.username, users.gender, users.country,

transactions.transactionDate, transactions.transactionAmount

FROM users INNER OUTER JOIN

transactions ON users.userId = transactions.userId

WHERE transactions.transactionAmount > 100;

Problems: There were 100,000 users that had spent money. Of those there were only a 1,000 instances

where a the amount spent was greater than 100.

Assume that on average there are 2 transactions per user.

The query above would result in retrieval of 200,000 records and then the WHERE clause would be applied to it to pick out the 1,000 such records where the amount was greater than 100.

This means that 99.5% of data fetched initially was of no use and wasted server resources (time and memory).

Page 27: Understanding databases and querying

Sub-queries for optimization

First, we know that we are only interested in transactions worth more than 100 dollars. Following query gets use only these transactions:

SELECT transactions.userId, transactions.transactionDate, transactions.transactionAmount

FROM transactions

WHERE transactions.transactionAmount > 100

Since, the output of the above query would be a table, we’ll use this one to JOIN with users table. The resulting query would be:

SELECT users.userId, users.username, users.gender, users.country,

t1.transactionDate, t1.transactionAmount

FROM users INNER OUTER JOIN

(SELECT transactions.userId, transactions.transactionDate,

transactions.transactionAmount

FROM transactions

WHERE transactions.transactionAmount > 100) AS t1 ON users.userId = t1.userId

Page 28: Understanding databases and querying

Sub-queries for Retrieving SCD

From the previous example, assume that now we’re interested in knowing when was the last time each of our users spent money along with their gender and country.

How can we go about doing this?

The query that does that is on the next slide, but first try thinking out how you can do that.

Page 29: Understanding databases and querying

Sub-queries for Retrieving SCD

First, lets write a query that retrieves the latest transaction.

SELECT MAX(transactions.transactionDate) AS lastTransactionDate

FROM transactions

OR

SELECT transactions.transactionDate

FROM transactions

ORDER BY transactions.transactionDate DESC

LIMIT 1

But we want to know the last transaction for each user. We can modify the first example as:

SELECT transactions.userId, MAX(transactions.transactionDate) AS lastTransactionDate

FROM transactions

GROUP BY transactions.userId

The second one cannot be modified in a way that would give us the desired because??

SELECT transactions.userId, transactions.transactionDate

FROM transactions

ORDER BY transactions.transactionDate DESC

LIMIT 1

Page 30: Understanding databases and querying

Sub-queries for Retrieving SCD

Now, we need to combine the result with user’s gender and country.

SELECT users.userId, users.gender, users.country, MAX(transactions.transactionDate) AS lastTransactionDate

FROM users LEFT OUTER JOIN

transactions ON users.userId = transactions.userId

GROUP BY users.userId, users.gender, users.location

The query above gives us the desired result, but it has one problem. What?

Page 31: Understanding databases and querying

Sub-queries for Retrieving SCD

We can use the discarded query two slides back if we can parameterize it somehow so that it evaluates for each user and gives us the last date. The following query does that:

SELECT users.userId, users.gender, users.country,

(SELECT transactions.transactionDate

FROM transactions

WHERE transactions.userId = users.userId

ORDER BY transactionDate DESC

LIMIT 1) AS lastTransactionDate

FROM users

The query above does not have a join.

It does not use an aggregate function in the main query and enables us to easily add more columns without worrying about the GROUP BY clause.

Modify the query above (or the one on previous slide) so that we now get the last transaction dates for transactions worth more than 50 dollars for each user. (Answer on next slide)

Page 32: Understanding databases and querying

Sub-queries for Retrieving SCD

SELECT users.userId, users.gender, users.country,

(SELECT transactions.transactionDate

FROM transactions

WHERE transactions.userId = users.userId

AND transactions.transactionAmount > 50

LIMIT 1) AS lastTransactionDate

FROM users

Page 33: Understanding databases and querying

Handling NULL Values

The query on previous slide would return rows for all one million users with most of them having lastTransactionDate as NULL.

NULLs don’t look good on a result set and are of no value for further analysis. We can resolve this situation in two ways.

Assume that we do need to see all one million users and would like to put a default value for the users that don’t have a transaction (such as 1.Jan.1900). Such values are called ‘sentinels’.

To replace a NULL, we can use a function ISNULL to replace the NULL with a sentinel value.

Page 34: Understanding databases and querying

Handling NULL Values

SELECT users.userId, users.gender, users.country,

ISNULL((SELECT transactions.transactionDate

FROM transactions

WHERE transactions.userId = users.userId

AND transactions.transactionAmount > 50

LIMIT 1), ‘1.Jan.1900’) AS lastTransactionDate

FROM users

Page 35: Understanding databases and querying

Sub-queries in WHERE clause

Or, we can modify the same query as:

SELECT users.userId, users.gender, users.country,

(SELECT transactions.transactionDate

FROM transactions

WHERE transactions.userId = users.userId

AND transactions.transactionAmount > 50

LIMIT 1) AS lastTransactionDate

FROM users

WHERE users.userId IN (SELECT transactions.userId

FROM transactions

WHERE transactions.transactionAmount > 50)

However, this is (and in general queries that user a sub-query in WHERE clause are) sub-optimal to the point that it is quite a bad query.

Page 36: Understanding databases and querying

Many-to-Many Relation Example

We are tasked to design a system for a college.

There are students and there are courses.

We need to provide a basic model that can store data for students, courses and enrollment of students in courses over years and semesters.

A student may have enrolled in multiple courses.

A course may have enrollment of multiple students.

A student may enroll in a course only once in a give semester of a year.

Try modelling the above scenario. The slide following this shows a common way to go about doing this.

Page 37: Understanding databases and querying

Many-to-Many Relation Example

Page 38: Understanding databases and querying

Many-to-Many Relation Example

Write a query that retrieves records of enrollment for all students ordered chronologically.

Write a query that retrieves semester-wise enrollment count for all courses

Write a query that displays students that have enrolled in the same course more than once along with the number of times they had enrolled.

Write a query to display last enrollment for all students.

Page 39: Understanding databases and querying

Questions