relational databases and sql session 1 an introduction

Relational Databasesand SQL

Session 1

An Introduction

Chris Smith, BRC, April 2004 2

Outline: Whole Course

1. The Relational Model.

2. Introduction to SQL.

3. Relational Database Systems.

4. Example Database Systems.

5. Database Design and Programming.

6. Database Programming Examples.


Outline: Relational Model and SQL

1. The Relational Model• History• The Relational Model Summarized• Tables and Keys• Relational Algebra

2. SQL• History• Data Manipulation Language• Data Definition Language

3. Relational Databases.• What are they?• Why use one?


The Relational Model: History

• Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks.E.F.Codd, IBM Research Report RJ599 (August 1969)

• A Relational Model of Data for Large Shared Data Banks.E.F.Codd, CACM 13 No. 6 (June 1970)

• Research and systems developed in the 1970’s. (e.g. Ingres, Oracle)


The Relational Model

• Summary of Codd’s work: Data should be represented as relations (tables).

item_table

item_no description cost price on_hand

011654 Mug 3.50 9.75 150

011665 Cup 2.75 6.54 225

011776 Bowl 5.98 12.34 112

011887 Serving bowl 10.59 27.00 40


Properties of Tables

• A table has a unique name (in some scope).• Each cell of the table can contain an “atomic”

value only.– First normal form (“no repeating groups”)

• Each column has a unique name (within the table).

• Values in a column all come from the same domain.

• Each row in the table is distinct.– Part of the model but not actually enforced!


Relational Model: Jargon

Relational Model (Formal)

Alternative 1 Alternative 2

Relation Table File(not common)

Tuple Row Record

Attribute Column Field

We will generally use Alternative 1.


Defining a Table

• A table is defined by giving a set of attribute and domain name pairs.

• This is called a Table Schema (or Relation Schema).

• A Relational Database Schema is a named set of relation schemas.

• We’ll just say “schema”, or “database schema” if needed.


Keys

• For practical purposes we want to be able to identify rows in our tables. – We use keys for this.

• A key is just a set of columns in the table.• Quite frequently just one column is

enough, and quite often it is obvious what it should be.

• There are rules of thumb regarding choosing keys which we will see later.


Keys: Jargon

Superkey A set of columns that uniquely identifies a row.

Candidate Key An irreducible superkey (no subset of the columns uniquely identifes the table rows).

Primary Key A selected candidate key.

Foreign Key A set of columns within one table that are a candidate key for some other table.


NULL Values

• A special value “NULL” is provided to allow for cells in a table that have an unspecified value.

• NULL is not the same as zero or the empty string, but represents complete absence of a value.

• Incorporation of NULL in the relational is contentious – but it’s here to stay.

• No part of a primary key may be NULL.


Example Schema

DeZign for databases, v2.5.2http://www.datanamic.com


Hierarchical Data

• The restriction to one atomic piece of data per cell precludes adding hierarchical data directly to a table.

• Use a separate table and a foreign key instead.• All “spots” are gathered into one table and

connected to their owner by the foreign key.• Using multiple tables helps reduce redundancy

e.g. gene annotation text is not duplicated for every spot with that gene.


Relational Algebra

• We have seen how to define tables (relations). We want to be able to manipulate them too.

• “The relational algebra is a theoretical language with operations that work on one or more relations to define another relation without changing the original relation(s).”(“Database Systems” Connolly and Begg.)


Relational Algebra:Unary Operations

• Selection– Take a subset of rows from a table (on some

criterion).

• Projection– Take a subset of columns from a table.


Relational Algebra:Binary Operations 1

• Union– Return all rows from two tables.– The two tables must have columns with the

same domains (union compatibility).

• Intersection– Return all matching rows from two tables.

• Difference– Return all rows from one table not in another.– The two tables must be union compatible.


Relational Algebra:Binary Operations 2

• Cartesian Product– Concatenate every row from one table with

every row from another.

• Join– Not really a separate operation: can be

defined in terms of cartesian product and selection.

– Is very important.


Relational Database Management System (RDBMS)

• Implements the relational model and relational algebra (under the covers).

• Provides a language for managing relations.• Provides a language for accessing and updating

data.• Provides other services:

– Security– Indexing for efficiency.– Backup services (maybe).– Distribution services (maybe).


RDBMS Implementation

• An RDBMS is usually implemented as a server program.

• Client programs communicate with the server (typically using TCP/IP).– In Unix-based systems the server will run as a

daemon.– In Windows it will run as a service.


SQL History

• Structured Query Language.

• Officially pronounced S-Q-L, but many people say “sequel”.

• Has its roots in the mid-1970’s.

• Standardized in 1986 (ANSI), 1987 (ISO)

• Further standards in 1992 (ISO SQL2 or SQL-92), 1999 (ISO SQL3).


SQL Today

• SQL is the only database language to have gained broad acceptance.

• Nearly every database system supports it.• The ISO SQL standard uses the “Table, Row,

Column” terminology rather than “Relation, Tuple, Attribute”.

• Some debate about how closely SQL adheres to the relational model.

• Many different dialects from different vendors.


SQL

• SQL is divided into two parts:– Data Manipulation Language– Data Definition Language

• Originally designed to be used from another language and not intended to be a complete programming language in its own right.

• Non-procedural. Define what you want, not how to get it.

• Supposed to be “English Like”!


SQL: Syntax

• Can be a little arcane.

• String literals are surrounded by single quotes. Numeric literals are not enclosed in quotes.

• SELECT price FROM item_table WHERE description = ‘Mug’


SQL: Data Manipulation Language

• Statements

– SELECT– INSERT– UPDATE– DELETE


SQL: SELECT

• SELECT is the real workhorse of SQL– It can perform the selection, projection and join

operations of the relational algebra.– And gets quite complicated.

• “Selects” rows from a table.– A database “query”.

• SELECT [DISTINCT] {*|[column_expression [ AS name]] [,…]}FROM table_name [alias] [,…] [WHERE condition][GROUP BY column_list] [HAVING condition][ORDER BY column_list [ASC|DESC] ]

• “Condition” is an expression composed of column names (as variables) and comparison operators.– The values of the variables range over all entries in the table.


SQL Operators• =, <>• IS NULL, IS NOT NULL• IN (value_list)• LIKE

– For string comparison with % and _ wildcards.– Standard SQL LIKE is case sensitive.– PostgreSQL has ILIKE for case insensitivity.– MySQL’s LIKE is case insensitive (but you can use the BINARY

keyword to force case insensitivity).• Regular expressions pattern matching (not standard)

– PostgreSQL: ~– MySQL: REGEXP

• Arithmetic operators (+,-,*,/,%)• Boolean operators (AND, OR, NOT)• And more…


SQL: SELECT

• SELECT specifies which columns to return (columns can be renamed with “AS”).

• FROM specifies the table(s) being considered.• WHERE restricts the rows being considered using some

criterion.– WHERE works strictly on a row-by-row basis.

• GROUP BY essentially executes the query for each value specified in the group clause. Returns one row for each such value.

• HAVING allows you to restrict the groups being considered.

• ORDER BY sorts the results.


SQL: SELECT

• SELECT 1• SELECT 1+SQRT(2)• SELECT USER• SELECT * FROM ath1_results• SELECT * FROM ath1_results WHERE experiment = ‘G’• SELECT COUNT(*) FROM ath1_results• SELECT * FROM ath1_results WHERE value > 50• SELECT clone_norm, COUNT(DISTINCT function)

FROM quant_genes_temp GROUP BY clone_norm HAVING COUNT(DISTINCT function) > 1


SQL: Aggregate Functions

• Can only appear in the SELECT clause, or a HAVING clause (not in a WHERE clause: WHERE applies to single rows).

• SUM, AVG, COUNT, MIN, MAX– Different systems provide others e.g. PostgreSQL has

STDDEV and VARIANCE, MySQL has STDDEV.

• Can use DISTINCT inside the parentheses: COUNT(DISTINCT name)

• Can use COUNT(*) to count number of rows.• Apart from COUNT(*), NULLs are ignored.


SQL: JOIN

• Cartesian product and selection from relational algebra.

• Joining large tables can be very, very slow (because of the product step): make sure you limit the results as much as possible.

• Different types of join determine behavior on mismatches.– LEFT JOIN includes rows with values on the left, but

no matching value on the right etc.• Joins recreate the spreadsheet view from a

hierarchical view of the data.


SQL: JOIN Examples


SQL: JOIN Examples

• SELECT COUNT(*) FROM trait_measurement m

• SELECT COUNT(*) FROM trait_measurement m, technician c WHERE m.technician_id = c.technician_id

• SELECT COUNT(*) FROM trait_measurement m LEFT JOIN technician c ON m.technician_id = c.technician_id

• SELECT COUNT(*) FROM trait_measurement m FULL JOIN technician c ON m.technician_id = c.technician_id


SQL: JOIN Types and Syntax

• JOIN Types– INNER JOIN

• Only exact matches.– CROSS JOIN

• Every pair of rows.– OUTER JOIN

• LEFT or RIGHT.– FULL JOIN

• Mismatches on both sides.

• JOIN conditions– ON condition– USING (columnName,…)– NATURAL

• Short for “USING all columns with matching names”


SQL: UNION, EXCEPT, INTERSECT

• (SELECT …) UNION [ALL] (SELECT …)• (SELECT …) EXCEPT [ALL] (SELECT …)• (SELECT …) INTERSECT [ALL] (SELECT …)

– INTERSECT is supported by PostgreSQL, but not MySQL (no big deal).

• Results of SELECTs must match.• Returns table consisting of distinct results from

both SELECTs, unless ALL is specified.


SQL: LIMIT and OFFSET

• Sometimes we want to limit the number of results returned by a query.

• Especially useful on web sites for dividing many result rows between pages.

• SELECT … LIMIT n OFFSET m

• Not always supported: but both MySQL and PostgreSQL have it.


Other Functions

• SQL allows other functions in SELECT statements.

• Highly dependent on the particular RDBMS being used.

• Some standard ones:– CURRENT_DATE– CURRENT_TIME– SUBSTRING– || (string concatenation)– LOWER, UPPER


SQL: INSERT

• INSERT INTO table_name [(column_list)] VALUES (value_list)

• INSERT INTO table_name [(column_list)] SELECT …

• Column_list is optional, but if not provided you must give values for all columns.– Defaults can be specified when the table is created.

• Second form allows moving data from table to table.


SQL: UPDATE

• UPDATE table_name SET col1 = val1[, col2 = val2 …] [WHERE condition]

• In general cannot UPDATE based on data in other tables.– Both MySQL and PostgreSQL provide an

extension to allow this.


SQL: DELETE

• DELETE FROM table_name [WHERE condition]

• (Too) easy to delete everything from a table.

• In general cannot DELETE based on data in other tables.– MySQL provides an extension to allow this.– PostgreSQL does not.


SQL: Data Definition Language

• Statements:– CREATE

• TABLE, VIEW, INDEX

– ALTER• TABLE

– DROP• TABLE, VIEW, INDEX


SQL: CREATE TABLE

• CREATE TABLE ({column_name data_type [DEFAULT value] [, …]}[, PRIMARY KEY (column_list)])

• Simplified! – Integrity mechanisms not shown.


SQL Numeric Data Types

• Exact numeric types– NUMERIC [(precision[,scale])]– DECIMAL [(precision[,scale])]

• DECIMAL(5,2) means 999.99

– INTEGER (INT)– SMALLINT– BIGINT

• Approximate numeric types– FLOAT [(precision)]– REAL– DOUBLE PRECISION


SQL Character Types

• CHAR(length)– Short form of CHARACTER

• VARCHAR(length) – Short form of CHARACTER VARYING

• PostgreSQL allows TEXT type for “long” character data fields.

• MySQL has TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT types!


SQL Date and Time Types

• DATE

• TIME [WITH TIME ZONE]

• TIMESTAMP [WITH TIME ZONE]

• INTERVAL– Not available in MySQL.


SQL: CREATE VIEW

• A view is a “virtual table”.• Not available in MySQL 4.

– Supposed to be coming in version 5.• Created as needed from a SELECT statement

given when the view is defined.• CREATE VIEW AS SELECT …

– Simplified!• Often used to restrict access to a table (by

hiding some columns or rows).• Also used to “hide” complex queries in the

database (rather than repeating them in code).


SQL: CREATE INDEX

• Used to enhance performance of SELECT’s (may slow down INSERT’s since index must be updated).

• Index columns used for frequently for lookup.

• Primary key columns are usually automatically indexed.

• CREATE INDEX index_name ON table_name (col1 [, …])


SQL: DROP

• DROP is used to remove tables, views and indices from the system.– DROP TABLE table_name– DROP INDEX index_name– DROP VIEW view_name

• For a table: all data in the table will be lost.


Creating a Database

• Creation of an entire database tends to depend on the RDBMS being used.

• Usually allow multiple named databases to be accessed through a single instance of a database server.


When to Use an RDBMS?

• Good for large amounts of data.– Indexing capabilities.

• Frequent updates:– Insertions of new values

• Many different views of the data wanted.• Associations between different entities (foreign keys).• Data integrity.

– Constraints.– Transactions.

• ACID = Atomicity, Consistency, Isolation, Durability.

• Integration with other systems e.g. web pages.• Sharing data between users.


Plain Old Text Files

• Can be perfect (even for largish amounts of data).

• Easier to hand over to someone else.– Don’t have to say “first install database X”.

• Not great for updates to existing values.

• No integrity checks (can be made in code).


SAS Datasets

• SAS allows SQL queries on its datasets.

• Datasets can be merged (= joined).

• Probably not indexed (speed).

• Very good for personal analysis of data, less good for shared data.


What We Have Not Covered

• Transactions and referential integrity– Very important in systems that are frequently

updated.– Less important in “read only” or infrequently updated

databases.– Add greatly to the complexity of RDBMS’s.

• SQL Stored procedures.• Security.• How to get data into the database from external

sources.


References

• “Database Systems”, Connolly and Begg,Addison Wesley, 3rd Edition, 2002

• “PostgreSQL”, Douglas and Douglas, SAMS Publishing, 2003

• “MySQL”, DuBois, SAMS Publishing, 2nd Edition, 2003

• http://www.postgresql.org– Recommended website for further reading.

• http://www.mysql.com


Summary

• RDBMS’s are good at manipulating data.

• Need to decide if you need one.

• SQL is the standard language.– Standard up to a point.


Session 2

Installing and Using a Database.


Outline: Example Systems

1. Choose MySQL or PostgreSQL.• Or both if you want to compare them.• They will happily co-exist on the same machine.

2. System installation and setup.3. Command-line interaction with the chosen

system.• Creating a new database (if necessary).• Creating some example tables.

4. Graphical tools.5. User rights assignment.


Operating Systems

• The lab we are using has Windows-based laptops. So we will be installing these systems on Windows.

• Both databases run on a variety of Unix-based systems.


Resources

• ftp://statgen.ncsu.edu/pub/chris/sql_course• Course CD


Choosing a Database

• MySQL – install is easier.– Has easier tools.

• PostgreSQL– More powerful database.

• but MySQL is catching up.

– No graphical tools (in the PostgreSQL package itself).• There is pgAdmin3 (I haven’t tried used this yet).

• For introductory purposes it won’t really matter which you choose.


Recommendation is…

• MySQL


Downloads for MySQL

• http://www.mysql.com– Want Windows “with installer” version– Version 4.0.18– Version 4.1 has considerably better SQL support, but is still

alpha.

• Also get – MySQL ODBC (full version)– MySQL Administrator (maybe)– MySQL Control Center

• All on the CD.• Manual at:

– http://dev.mysql.com/doc/mysql/en/index.html


Downloads for PostgreSQL

• http://www.postgresql.org– No Windows build available directly from the

PostgreSQL website!

• http://www.cygwin.com– Latest (version 1.5.9-1).– (Setup program version is 2.416).

• On the CD.• Manual at:

– http://www.postgresql.org/docs/7.4/static/index.html


Command Line Interfaces

• For both MySQL and PostgreSQL we will be using command line interfaces. They are call mysql and psql respectively.

• Both interfaces give you a command line prompt.

• In both cases you can type some commands (e.g. help) and hit the enter key.

• You can also type SQL statements terminated with a semi-colon. – If you do not give the semi-colon the application will

assume that you are going to type more and will change the prompt slightly to indicate this. (Try it out.)


MySQL Installation Plan

• Install the program.– Using the supplied installer.

• Check that it runs OK.• Stop the server.• Perform some extra setup.• Restart the server.• Try it out.• Complete configuration.

– Create a test database.• Install supplementary programs.


MySQL Installation: Simple Version

• Really easy!• Read the printed instructions for alternatives.• On the CD directory mysql\mysql-4.0.18-win

contains what you need - run setup.exe.• Choose “Typical” installation.• Accept default directory.

– May want to use a non-default directory if you have limited space on your C drive.

– If so, you need to create an options file for MySQL (use Notepad).


Running MySQL the first time

• Open a command-line window.• Change directory to c:\mysql\bin. (Or your

installation directory.)• Choose a server(!)

– mysqld-max-nt recommended.

• Run “mysqld-max-nt --console”– “console” option directs messages to the screen.– Should see a number of messages ending in:mysqld: ready for connectionsVersion: '4.0.14-log' socket: '' port: 3306

• It’s running!


Stopping MySQL

• Open a new command-line window.

• In c:\mysql\bin run:– “mysqladmin –u root shutdown”

• Server will stop.


Restarting the MySQL server

• Open a command-line window.

• Change directory to c:\mysql\bin.

• Run “mysqld-max-nt”.

• Info and errors will be logged to c:\mysql\data\<machine_name>.err


Running MySQL as a service

• A Windows service starts whenever the system is booted. – No need for a user to log on.– Correct thing to do on a production server.– On a personal machine it’s up to you.

• From a command-line:– mysqld-max-nt –install

• Adds the service.• Use standard Windows tools to control the

service.


Running MySQL from a Command Line

• Command-line interface is mysql.mysql –u <username> <dbname>

mysql –p –u <username> <dbname>

• -p means “use password”.

• You will get a prompt:mysql>


Security Issue

• MySQL opens a TCP/IP port (3306 by default) on your machine.

• Hackers will try to attack this port – just like they do any other.

• Run a firewall!– Locally

• E.g. ZoneAlarm from ZoneLabs• Try the free version.

– On a router.


Securing your server• After installation anyone can connect and have root user privileges

(within the database).• Change directory to c:\mysql\bin.• To give root a password:

– Run “mysql –u root”.– At the “mysql>” prompt execute:

SET PASSWORD FOR ‘root’@’localhost’ = PASSWORD(‘rootpass’);SET PASSWORD FOR ‘root’@’%’ = PASSWORD(‘rootpass’);QUIT

• From now on you will need “mysql –p –u root” to start mysql.• To remove anonymous users:

– Restart mysql with “mysql –p –u root”.– At the “mysql>” prompt execute:

USE mysql;DELETE FROM user WHERE user = ‘’;DELETE FROM db WHERE user = ‘’;FLUSH PRIVILEGES;


Create your own database

• Create a user to be the administrator of the new database. (Not absolutely necessary, but recommended.)

mysql> GRANT ALL ON testdb.* TO ‘testroot’@’localhost’ IDENTIFIED BY ‘xxxx’;

– ‘xxxx’ is the user’s password.• Create the new database:

mysql> CREATE DATABASE testdb;

• Restart mysql with the new user:

mysql –p –u testdb

• Switch to the new database:

mysql> use testdb;


MyODBC Installation

• From a command line:– Run MyODBC-3.51.06.exe– No questions asked!


Setting up a DSN

• Open the Windows Control Panel.• Find the “ODBC Data Sources” program.• Click the “File DSN” tab.• Click the Add button.• Select the “MySQL 3.51 Driver”. Click

“Next”.• Choose a filename. Click “Next”.• Fill in your MySQL details. Click “OK”.


Reading data into Excel

• Open Excel.• Open the “Data” menu.• Click “Import external data”.• Click “Import data”.• Find your .dsn file.• Click “Open”.• Select the table you want.

– If there is only one table available you won’t be given a choice.


Installing MySQL Administrator

• Optional.

• “Alpha” software– Not yet fully tested.– May contain serious bugs.

• GUI administration tool.

• Run “setup.exe” from mysql-administrator-1.0.3-alpha-win.

• No difficult questions.


Using MySQL Administrator

• Run it from “Start”, “All Programs”, “MySQL”.

• Allows:– Stopping and starting MySQL.– User administration.– Backup and restore.– …

• Worth a look if you are managing a server.• Worth trying if you prefer a GUI.


Installing the MySQL Control Center

• Run setup.exe from mysqlcc-0.9.4-win32.

• Can either install translations or trun installation off.

• Run the control center from “Start”, “All programs”, “MySQL Control Center”.– Slightly annoying that it puts it in a separate

menu entry from the MySQL Administrator.


Using the MySQL Control Center

• Start it up.• “Register” a new server.• Use “localhost” as the host name.• User name can be “root”.• Enter your password.• Click “Test” to make sure it’s OK.• Click “Add”.• Select a database and click “SQL”, or double click a

table name (and then click “SQL”).– Lets you enter queries and see results.– You can update the values in the database by editing them.

• Looks like an OK tool for playing with SQL.


PostgreSQL Installation Plan

• PostgreSQL is not a “native” Windows application.– Needs some Unix functions.– But there’s nothing magic about Unix functions.– An upcoming version will have a native Windows

binary included. (version 7.5 or 8.)• PostgreSQL can be run under Cygwin.

– A set of libraries for Windows that implement Unix functions.

• Plan is to install Cygwin – It includes a build of PostgreSQL.– It also includes a version of perl.


Installing Cygwin (and Postgres)

• Run setup.exe from the cygwin directory.• Choose “install from local directory”.• Choose an installation directory (default is c:\cygwin).• When asked to specify the local package directory click

browse and choose the long name beginning with http …• Click Next.• Under “Admin” check cygrunsrv.• Under “Databases” check postgresql.• Under “Devel” make sure that cygipc is checked.• When installation completes choose to have an icon

placed on the desktop or in the start menu.


Completing PostgreSQL Installation

• Start a cygwin (bash) command line (from the icon that was added).

• Start the IPC daemon:ipc-daemon2&

• Initialize the database:initdb –D /var/postgresql/data

• Start the database server:postmaster –i –D /var/postgresql/data &

• The “-i” tells Postgres to accept TCP/IP connections.


Stopping PostgreSQL

• At a bash command line type:pg_ctl –D /var/postgresql/data stop


Running PostgreSQL as a service

• More complex than for MySQL.

• Need to get ownership of files correct.

• Uses cygrunsrv program from cygwin installation.

• Read the installation notes!


Using PostgreSQL from a Command Line

• The command–line interface is psql.psql –u <username> dbname

• You will get a prompt:dbname=#


Creating your database• At a bash command line:

createuser –a –d –P –E testrootcreatedb –O testroot testdbpsql –U testroot testdb

• Still won’t be prompted for a password – need to change a configuration file.

• Edit file: c:\cygwin\var\postgresql\data\pg_hba.conf• Change “trust” to “password” in the configuration lines at the bottom

of the file.• Restart the server.

pg_ctl –D /var/postgresql/data stoppostmaster –i –D /var/postgresql/data &

• You will now be prompted for a password when you run psql.


Cygwin Notes

• A Unix-style directory structure exists inside your cygwin installation directory.

• C:\cygwin\usrC:\cygwin\varC:\cygwin\home …

• When you cd /usr at the bash command prompt you actually change directory to c:\cygwin\usr.

• To get “out” to a directory that is not a subdirectory of c:\cygwin use “cd /cygdrive/c/…”

• (/cygdrive/d/… for drive d etc.)


Example Data

• Arabidopsis genome– (large) XML files from TIGR.– “Flattened” into two relational tables.

• create_ath1.sql– Contains definitions for the tables.

• ath1_gene.txt, ath1_feat.txt– Contain the data.


Create the tables

• Change to directory with create_ath1.sql• MySQL

mysql –p –u testroot testdbmysql> source create_ath1.sqlmysql> show tables;mysql> describe ath1_gene;

• PostgreSQLpsql –U testroot testdb\i create_ath1.sql\d\d ath1_gene


Copy the data

• MySQLmysql> load data local infile ‘c:/…./ath1_gene.txt’ into table ath1_gene;

mysql> load data local infile ‘c:/…./ath1_feat.txt’ into table ath1_feat;

– Note – forward slashes (or double up backslashes).

• PostgreSQLtestdb=# copy ath1_gene from ‘/cygdrive/c/…/ath1_gene.txt’;

testdb=# copy ath1_feat from ‘/cygdrive/c/…/ath1_feat.txt’;

– Note – forward slashes and “/cygdrive/c/…”.


The Example Tables

• ath1_gene– One row for each TU (transcription unit) from the

original XML files.– Some other information:

• Start and end position (base pair).• Annotations from TAIR and Affymetrix.

– Data is from last year.• ath1_feat

– One row for each “feature” in the corresponding TU.• UTRs, introns, cds

– The “model” ids indicate alternative splicing possibilities.


Exercises

1. Get a list of the different type codes found in the ath1_feat table.

2. How many genes are there?

3. Get a list of number of genes per chromosome.

4. Get a list of gene ids and gene lengths.• Watch out for negative lengths!

5. What is the length of the longest gene?

6. What is the length of the shortest gene?

7. What is the average length of a gene?


Harder Exercises

1. Get a list of the lengths of the chromosomes.• (Slightly tricky because of the reversals.)

2. Get a list with 3 columns:• Chromosome number, chromosome length, number

of genes in chromosome.3. How many genes have more than one model?4. Which gene has most “features”?5. Do the numbers in the gene identifiers

“At1g01010” etc. always go up as distance along the chromosome increases? (Do this one chromosome at a time.)


Solutions to exercises1. SELECT DISTINCT type FROM ath1_feat;2. SELECT COUNT(*) FROM ath1_gene;3. SELECT chromosome, COUNT(*) FROM ath1_gene GROUP BY

chromosome;4. SELECT tair_id, end_bp – start_bp FROM ath1_gene;

• Use ABS() to get rid of the negative values.• Reverse direction is indicated by the start/end values being reversed.

5. SELECT tair_id, ABS(end_bp – start_bp) AS len FROM ath1_gene ORDER BY len DESC LIMIT 1;

6. SELECT tair_id, ABS(end_bp – start_bp) AS len FROM ath1_gene ORDER BY len ASC LIMIT 1;

7. SELECT AVG(ABS(end_bp – start_bp)) FROM ath1_gene;


Solutions to Harder Exercises1. PostgreSQL:

SELECT chromosome, MAX(CASE WHEN start_bp > end_bp THEN start_bp ELSE end_bp END) FROM ath1_gene GROUP BY chromosome;MySQL:SELECT chromosome, MAX(GREATEST(start_bp, end_bp)) FROM ath1_gene GROUP BY chromosome;

2. SELECT chromosome, MAX(GREATEST(start_bp, end_bp)), COUNT(start_bp) FROM ath1_gene GROUP BY chromosome;

3. PostgreSQL: SELECT COUNT(*) FROM (SELECT ath1_gene_id, COUNT(model_id) FROM ath1_feat GROUP BY ath1_gene_id HAVING COUNT(model_id) > 1) temp;

MySQL: (can’t use same query since no subselects). CREATE TABLE temp AS SELECT ath1_gene_id, COUNT(model_id) FROM ath1_feat GROUP BY ath1_gene_id HAVING COUNT(model_id) > 1;(MySQL will tell you how many rows in the result table, or SELECT COUNT(*) FROM temp;)DROP TABLE temp;

4. SELECT ath1_gene_id, COUNT(*) AS num FROM ath1_feat GROUP BY ath1_gene_id ORDER BY num DESC LIMIT 1;

5. SELECT a1.ath1_gene_id, a2.ath1_gene_id, a1.start_bp, a2.start_bp FROM ath1_gene a1, ath1_gene a2 WHERE a1.chromosome = 1 AND a2.chromosome = 1 AND a1.tair_id > a2.tair_id AND a1.start_bp < a2.start_bp;

• Will take some time to run!


Session 3Script Languages and Database

Access


Script Languages and Database Access

1. Tidy-up from previous sessions.

2. Perl and the DBI.

3. PHP and the Pear DB.

4. Example script.

5. Users and User Rights.


Errata

• Session 1:JOIN types differ on their treatment of mismatches (not NULLs as written on the slide).

• Session 2:The MySQL ODBC driver will let you update data through Microsoft Access.


Resetting the MySQL root password

• Exit the mysql client (if you have it running).• “Kill” the mysql server using the Windows task manager

(look for processes named mysqld…).– May not be one running.

• Restart the server using:mysqld-max-nt --skip-grant-tables

• Run the following commands:mysqladmin –u root flush-privileges password “rootpass”mysqladmin –p –u root shutdown

• Everyone using one of the lab systems should use “rootpass” – so it doesn’t matter if you get a different machine next time.

• (Enter the new password on the shutdown command.)• Restart the server (without the skip-grant-tables option).


Disallowing network connections in MySQL

• When starting the server use these options:

--skip-networking--enable-named-pipes


Resetting the PostgreSQL password

• Stop the postmaster.pg_ctl –D /var/postgresql/data stop

• Edit /var/postgresql/data/pg_hba.conf.• Set the authentication mechanism for local and IP address

127.0.0.1 to “trust”.• Restart the postmaster.• Use psql to change the password for user “testroot”.

ALTER USER testroot PASSWORD ‘password’;• Stop the postmaster and restore the authentication mechanisms.• Restart the postmaster. • Can also tell the postmaster to reload the configuration files using:

pg_ctl –D /var/postgresql/data reload• Clearly need to be careful about who can run pg_ctl, and who can

edit pg_hba.conf!


Poll

• Who would like me to go through the solutions to the exercises and explain why/how/if they work?


Script Languages

• Interpreted (rather than compiled)– Write it and try it.

• Dynamically typed– Don’t have to declare the type of a variable.– May not have to declare variables at all.

• Perl, PHP, Python, Ruby, JavaScript, …


Perl

• Popular scripting language.

• Somewhat C-like.

• A lot of quirks.

• A lot of add-on modules.– BioPerl (http://www.bioperl.org)


Perl Information

• Get Perl info and packages from:http://cpan.org

• There is an Apache module mod_perl allowing efficient execution of Perl scripts within a web server.


Installing ActiveState Perl

• Will use ActiveState Perl with MySQL– Built for Windows.– Nice installer.

• Use the .msi file:ActivePerl-5.8.3.809-MSWin32-x86.msi– On Windows 98 you may need to install the Microsoft

Installer. On XP or 2000 it is already present.• Have the installer add the Perl binary directory to

your path.• Install takes some time (but you just have to sit

and wait).


Check your install

• From a command line type:perl -v

• Should get version information for ActiveState Perl v5.8.3 binary build 809.

• Type:perl hello.pl

• hello.pl is in course downloads (but is very simple: print “Hello, world\n”;).

• Should get “Hello, world” printed in response.


Installing the Perl DBI• We will use the Perl DBI (Database Interface).• With ActiveState Perl there is a “package management” tool “ppm”.• At a Windows command prompt enter:

ppm• You will get a ppm> prompt.• Type:

install DBI• You should get some lines of information ending with something like

“succesfully installed”.• Then type:

install DBD-mysql• You should get more lines of information and another “successfully

installed” message.• Type “q” to quit from ppm.


Testing the Perl DBI/MySQL DBD

• mysql_test.pl is in the downloaded course files (and on the CD).

• Edit the mysql_test.pl file using Notepad (or another editor).

• You will see lines of Perl code setting the database name, the user name and password. Check that these match your installation.

• Make sure the MySQL server is running.• At Windows command line type:

perl mysql_test.pl• Should get 20 lines of results from the ath1_gene table.


Installing Cygwin Perl

• You can also install a ready-built Perl through Cygwin.

• Use this if you installed PostgreSQL under Cygwin.

• Start the Cygwin setup utility.• Click through to the package selection dialog.• Add perl (under interpreters).• Also add gcc and make (under “Devel” - they will

be used to install the perl DBD for PostgreSQL).


Adding the DBI under Cygwin• Check that the Perl you are using is the Cygwin version - type:

perl –vat the Windows command line.

• Copy DBI-1.40.tar.gz and DBD-Pg-1.31.tar.gz to your hard drive (probably best to make a new directory).

• Untar/zip these files.• To install the DBI…

– Change to the DBI-1.40 directory created by unzipping the file above.– Type:

perl Makefile.PL– You can ignore messages relating to Windows users and make since

we are using Cygwin.– Type:

makemake testmake install

– Watch for errors! (Some tests may not work.)


Adding the PostgreSQL DBD.

• Untar/unzip the DBD-Pg-1.31.tar.gz file.

• Perform the same steps as for the DBI…perl Makefile.PLmakemake testmake install


Testing the PostgreSQL DBD

• pg_test.pl is in the downloaded course files (and on the CD).

• Edit the pg_test.pl file using Notepad (or another editor).

• You will see lines of Perl code setting the database name, the user name and password. Check that these match your installation.

• At Windows command line type:perl pg_test.pl

• Should get 20 lines of results from the ath1_gene table.


Scripting and SQL Parameters• Suppose you want to update a row in a table with some text values. e.g.

UPDATE ath1_gene SET tigr_annotation = ‘Hypothetical protein’ WHERE tair_id = ‘At1g01010’

• You may want to do this many times (with different values for the annotation and the gene identifier.

• You could rebuild the query each time you run it to include the new text values.• But provision is made for parameters in an SQL statement. These are represented by

a ‘?’ character. e.g. UPDATE ath1_gene SET tigr_annotation = ? WHERE tair_id = ?

• This statement is “prepared” and then “executed”. In the execution step we provide values for the parameters. (PostgreSQL does not support prepare, so the API fakes it. MySQL will suport prepared statements in version 4.1.)

• Values are provided in an array which must be in order: first ? is replaced by the first entry in the array, second ? with the second entry etc.

• You do not have to provide quotes round string values.– This is very useful since it means that you do not have to check all your strings for

embedded single quote characters.• Very common to see:

INSERT INTO table_name VALUES (?,?,?,?,?,?)(All values are provided as parameters.)


The Perl DBI: Database Handles

• use DBI;• $dbh = DBI->connect($url, $user, $password);

– Use a connection URL.• DBI:<driver>:<options>• The format of <options> depends on the driver being used.• Returns a database handle (represented as $dbh in the following).

• $sth = $dbh->prepare(“…”);– Create a statement handle from SQL text.– You have to do this before you can execute a SELECT

statement. (For other statements you can use “do”.)• $num = $dbh->do(“…”);

– Useful for non-SELECT statements.– Returns number of rows affected.


The Perl DBI: Statement Handles

• $sth->execute();$sth->execute(@params);– Execute a prepared statement.– An array of parameters can be supplied.– Number of parameters provided should match the

number appearing in the SQL statement.• Check $sth->err after the execute to make

sure all is well.• $sth->rows; will tell you how many rows were

affected by the query (for non-SELECT statements only, not reliable for SELECT on all drivers).


Perl DBI: Fetching Rows

• After you have executed a statement, you fetch the result rows from the statement handle:

• $array = $sth->fetchrow_array;$hash_ref = $sth->fetchrow_hashref;$array_ref = $sth->fetchrow_arrayref;

• Can also fetch all rows at once.fetchall_arrayreffetchall_hashref


Perl DBI: Statement Done

• After you have finished with a statement you should call “finish” to let the API release any resources associated with it.

$sth->finish;


PHP

• Very C/C++-like.– Easy to pick up if you know C.– Some similarities to Perl.– Fewer idiosyncrasies than Perl (in my opinion).

• Replacing Perl for dynamic web sites.• Some sites use PHP for creating web pages and

Perl for background applications.• Definitely easy to create small sites. Not so sure

about large sites (no namespace support).• I like it (except for the fact that you still have to

start every variable name with a ‘$’ sign, yuck!)


PHP Information

• Downloads and documentation at:http://www.php.net

• There is an Apache module mod_php that lets PHP code run efficiently within Apache.


Installing PHP

• Use the “manual install” rather than the installer. (The installer only installs the CGI version.)– We are not going to be talking about installing a web

server. But the code we write in PHP would work just as well from within a web server.

• Downloaded file is php-4.3.6-Win32.zip.• Unzip this file into C:\.• You will get a directory named:

c:\php-4.3.6-Win32• Consider renaming this to c:\php for ease of use!


Installing PHP: Continued

• In php4ts.dll must be in your path:– Put c:\php into your path. – Alternative is to copy it to somewhere in the path e.g. C:\

windows\system32

• Also want C:\php\cli\php.exe in our path.– Either copy it to C:\php (if you put C:\php in your path).– Or copy php.exe to C:\windows\system32.– I renamed mine to phpcli.exe to distinguish it from the CGI

version – php.exe.

• Copy php.ini-recommended to C:\windows\php.ini– Note the name change!


Testing PHP

• There is a file php_test.php in the download directory.

• At a Windows command line type:php php_test.php

• Should get lots of information printed to the console.


Pear DB• PHP has built-in interfaces to MySQL and PostgreSQL.• It also has an equivalent of the Perl DBI. This is the Pear DB

module.• At a Windows command line, change directory to C:\php and type:

go-pear• Let the script update your ini file when it asks.• The MySQL interface is active by default (on Windows).• The PostgreSQL interface must be activated by uncommenting the

following line in php.ini.;extension=php_pgsql.dll– Delete the semi-colon to uncomment.

• Must also make sure PHP can find the PostgreSQL “extension” – set the value of the “extension_dir” option in php.ini to:

c:\php\extensions


Testing the Installation

• In the download directory there are two files: pg_test.php and mysql_test.php.

• Look in them to see that they are very similar to the Perl versions.

• Run them from a Windows command line by typing (one of):

php pg_test.phpphp mysql_test.php


Pear DB: Database Handle

• Similar to the Perl DBI database handle.

• Look for details at:http://pear.php.net

• In general check for errors using:DB::isError($val);

– Where $val is the result from any Pear DB call.



• include “DB.php”;• Get a database handle by connecting to the database

using a URL:$db = DB::connect($url);

• Can prepare and execute a query (just like in the Perl DBI).

$pq = $db->prepare(“…”);$res = $db->execute($pq, $parms);

• Don’t forget the error checking.• $parms is a list of parameters matching any ‘?’

characters in the prepared query.• Or just use “query”:

$res = $db->query(“…”, $parms);



• There are also functions letting you execute a query and fetch all result rows, or just one row, or just a single value.– getAll();– getRow();– getOne(); (or getCol();)


Pear DB: Results

• $res = $db->query(“…”, $parms);• Can fetch the results as an ordered array (index

by integers), associative array (indexed by column names), or as a PHP object (members have the same name as the columns).– Get the value with $res->col_name;

• Default “fetch mode” can be set globally:$db->setFetchMode(DB_FETCHMODE_OBJECT);

• Can also be set on the fetchRow call:$res->fetchRow(DB_FETCHMODE_OBJECT);


Pear DB: Tidying Up

• When you have finished with a result set:$res->free();

• When you have finished talking to the database:

$db->disconnect();


Example Program

• Example script– In the “microarray” subdirectory of the course

downloads.• mysql_import.pl, mysql_import.php• pg_import.pl, pg_import.php

• Import some microarray data downloaded from the Stanford Microarray Database.

• Data files are in directories named after the experimenters and within “experiment set” subdirectories.


Example Data

• Each data file contains some header lines describing the experimental conditions.

• The column names in these files vary from experimenter to experimenter.

• The number and order of columns in the data files is not fixed.

• The code attempts to find the columns in which we are interested.

• There are 4 data files (from 4 microarrays) containing data for about 170,000 spots.


Example Code

• The example code DROPs an existing table, and recreates it.

• It expects a specific directory structure.• It then uses INSERT to add new entries to the

newly created table.• With PostgreSQL we use a transaction:

– Without a transaction it runs veeerrrrrrryyyy slllllllooooowwwwwllllyyyy.

• With MySQL:– Even if we try starting a transaction it doesn’t use one.


Execution Time

Run Times MySQL PostgreSQL No DB

ASPerl/DBI 92s 29s

Cygwin Perl/DBI

720s 85s

PHP/Pear DB 215s 175s 20s

No DB = INSERT statement commented out. ASPerl = ActiveState Perl.


Execution Time: Meaning

• A horrible results for PostgreSQL and the Perl DBI under Cygwin.– But the PostgreSQL and PHP result is good.

• So it isn’t the database itself that is slow.

– Also the “No DB” version with Cygwin Perl is bad, but not horrible.

– Have to point the finger at the Perl DBI under Cygwin – but should investigate further.

• Looks like the MySQL interface under PHP is not as good as it could be (but it isn’t too bad).


User Rights Assignment: GRANT

• GRANT– Grantable privileges are:

• SELECT, INSERT, UPDATE, DELETE, REFERENCES, USAGE

• GRANT SELECT ON ath1_gene TO PUBLIC;– Lets anyone read from the ath1_gene table.

• GRANT INSERT,UPDATE,DELETE ON ath1_gene TO chris;– Lets a user called ‘chris’ make changes to the table.

• GRANT ALL PRIVILEGES ON ath1_gene TO chris;– Lets user chris do anything with table ath1_gene.

• GRANT …..TO chris WITH GRANT OPTION;– Allows user chris to grant other users privileges on the table.


User Rights Assignment: REVOKE

• To remove a privilege from a user:– REVOKE INSERT,UPDATE,DELETE ON

ath1_gene FROM chris;


Managing Users in MySQL

• MySQL extends the GRANT syntax considerably.

• Uses the GRANT command to create users as well as manage privileges.


Managing Users in PostgreSQL

• CREATE USER …

• GRANT is standard SQL.


Keeping Your Database Efficient

• Both MySQL and PostgreSQL tables can tend to become fragmented over time (expecially if lots of updates are made).

• Both databases provide mechanisms for tidying up.


PostgreSQL VACUUM

• VACUUM [FULL] [ANALYZE];– VACUUM defragments the database.– FULL returns space to the disk drive.– ANALYZE updates PostgreSQL’s statistics

(helping the query optimizer give good results).


MySQL OPTIMIZE TABLE

• OPTIMIZE TABLE table_name;– Can be used to optimize some types of table.– (I haven’t talked about the different types of

MySQL table!)


Session 4Database Design


Database Design

1. Comments on last session.

2. Entities and Relationships.

3. Normalization.

4. Examples.


Loading Data 1

• Last time we saw a script that used the INSERT statement to load data into a microarray data table.

• We also saw that the PostgreSQL/Cygwin/Perl DBI combination was quite slow at this.

• We could have parsed our microarray data files into plain tab-delimited text files and then used the PostgreSQL COPY command, (or the MySQL LOAD DATA command) as we did in the ath1_gene example.

• This would have been faster.


Loading Data 2

• We saw the difference between MySQL and PostgreSQL when loading the microarray data.

• PostgreSQL was loading each array within a transaction. MySQL was not.

• One technique to prevent getting partial data into the table would be to first load the data to a temporary table.– Then when the temporary table holds the data for one

array copy it to the permanent table using INSERT INTO table_name SELECT … syntax.


Perl and PHP “Standards”

• Last time someone asked whether Perl and PHP are standardized.

• They aren’t – but there is only (currently) one source for each language.

• So they are effectively standard.

• They do tend to change considerably from version to version.


Entities

• An entity is anything for which we would like to store some data.– A customer in a store.– A customer order.– Microarray experiment.– An individual microarray.– A tree.– …


Relationships

• Some entities are logically associated with other entities.– E.g. an individual microarray belongs to a

specific microarray experiment.

• We say that there is a relationship between the microarray entity and the microarray experiment entity.


ER Diagrams

• We document entities and relationships using an entity-relationship diagram.

• There are a number of different conventions used for drawing these diagrams: Chen, “Information Engineering”.

• Doesn’t really matter which you use.


Example ER Diagram


ER Diagrams

• When creating an ER diagram you are supposed to be documenting the entities and relationships in your domain of interest.

• But the entities are (more-or-less) going to become tables in your database.– There isn’t a one-one mapping here. We’ll see

examples later.


ER Diagram Tools

• Couldn’t find a decent free one.• DeZign for Databases from Datanamic.

– Supports many different databases.– Easy to use.– Generates the SQL DDL code for your database.– Not too expensive (but not really cheap).

• Visio can do ER-diagrams.– But won’t generate code for you.

• You can always use a piece of paper and write the code yourself!


Relationship Types

• There are 3 basic types of relationship.– One-to-one.

• People (in the US) and Social Security Numbers (trivial).• A store is managed by a single person.

– One-to-many.• Trees planted on plots of land. Each plot of land can have

many trees, but each tree is on just one plot.

– Many-to-many.• Genes and microarrays. A gene can appear on many

microarrays and each microarray contains many genes.


One-to-Many Relationships

• Very common.

• We have already seen one:– ath1_gene, ath1_feat– Each gene has many features, each feature

belongs to just one gene.– Represented in the database by including a

gene id in the feature table.


Many-to-Many Relationships

• Can’t be represented directly between 2 tables.– Would need multiple gene entries per gene!

• Or multiple feature ids in a gene record.

• Solution is to use a third table to represent the relationship.– Third table contains rows with (essentially) two

columns: the primary keys from each of the related tables.

– The entries in this table are known as composite entities.

• If you are using a design tool (such as DeZign) it may do this for you.


Relationship Data

• Sometimes you will need to add extra attributes to the composite entity.– E.g. In a store you have items and orders for

those items. There is a many-many relationship between orders and items. Where do you keep the number of items being ordered?

• Add it to the composite entity create for the many-many relationship.

• You might have modelled this as a “line item” anyway.


Choosing Primary Keys• Once you have determined your entities you should look at which columns

can be used as primary keys.– (In practice you won’t do this as a separate step, you’ll be doing it as you go

along.)• As a rule-of-thumb avoid primary keys that contain meaning.

– Keys with embedded meaning have a tendency to change causing problems in your database.

– Especially don’t use any value that is likely to change e.g. telephone number seems like it might be a good identifier but changes when someone moves. (On the other hand it may be a good identifier if you are the telephone company!)

• If there is no obvious (non-meaningful) primary key you can add a column that contains an arbitrary (unique) identifier.

– The database system you are using likely provides a feature that will do this for you. In PostgreSQL it is the “serial” type, in MySQL it is the “autonumber” type.

• Primary keys can be constructed from multiple columns.– For some tables all columns may be in the primary key.


Normalization• There are a number of design rules in the text books. These are

given the name “normal forms”– First normal form (already mentioned).– Second normal form.– Third normal form.– Boyce-Codd normal form.– Fourth normal form.– Fifth normal form.– Domain-Key normal form.

• A database doesn’t have to be in any of the normal forms in order to be useful.

• The normal forms do help to avoid problems – usually to do with insertion and deletion.

• Sometimes it is OK to break normal form for performance reasons.


First Normal Form• One column – one value.

– No “repeating groups” (attribute with multiple values for one instance of an entity).

• Example is “child’s name” in a person table.– People often have more than one child.– Papers have more than one author.

• Could use a comma-separated string of names in a single column.– Difficult to update, difficult to search.

• Could add multiple columns to the table: child1, child2, child3.– This is bad because it limits the maximum number of children allowed, it

wastes space for people with no children, it makes queries difficult to write.

• Right way is to use a separate “child” table (one-to-many relationship), – Or, in this case, put children into the “person” table with a “parent”

foreign key.


Second Normal Form

• A relation is in second normal form if:It is in first normal form and all non-key attributes are functionally dependent on the entire primary key (and not on any subset of the primary key).

• Key word is “entire”. • Also known as “full functional dependence”.• Like a mathematical function (one input value,

one output value).• We are trying to eliminate attributes which only

depend on part of the key.• Basically says “don’t repeat values in different

rows”.


Not Second Normal Form• Suppose I had decided to put all my Arabidopsis gene information in

one table.• I could have had one row per feature and repeated the gene

information for each feature. • Primary key would then have to include a gene id and a feature id.• Gene information only depends on the gene id (not on the feature

id). So this table would not be in second normal form.• Repetition of information is one problem (gene details would be in

multiple rows).• I could no longer add a gene with no features (since I would have a

NULL as part of the primary key).• If I wanted to delete a feature (having found out that it was incorrect)

I might end up deleting all the gene information too (if this was the gene’s only feature).

• Right way to go is to split genes and features.


Third Normal Form

• A relation is in third normal form if:it is in second normal form and no non-primary-key attribute is transitively dependent on the primary key.

• Transitive dependency is where: column B depends on column A, and column C depends on column B.

ABC• E.g. In a realtor’s database we have a “property” table

that includes some details of the owner. Property number Owner name Owner phone number

• Problem is redundant information. Duplication of owner phone number (if the owner has multiple properties).

• Solution is to split the property table into “property” and “owner” tables.


Third Normal Form

• In Third Normal Form: all non-key fields are dependent on the key, the whole key and nothing but the key.

• Basically means “don’t mix different entities in one table”.

• For most purposes 3rd Normal Form is enough.


Normalization and Tables

• Each level of normalization involves splitting a table into multiple tables.

• You can end up with a lot of tables!

• You can always get back what you started with (using joins).

• Use views to hide the complexity.


Example: Microarray Database

• Task: design a database to store microarray results for multiple experiments.

• One experiment consists of multiple arrays.• Assume that we are using the same microarray for all

experiments.• Assume it is a spotted microarray (two dyes per array -

two samples hybridized to each array).• Assume we have a list associating each spot on the

array with a gene identifier.• The results data consists of 2 intensity readings per spot

on the array (real results have a lot more information).


Design Process

• Start by listing the obvious entities.– Choose the nouns in the description.


Add Attributes


Add Relationships

• Experiment -> microarray: 1-N

• Microarray -> spot: 1-N

• Microarray -> sample: N-2

• Spot -> gene: N-1

• In reality we have to choose more explicit limits on the relationships (0-N, 1-N).


First Draft


Small Problem

• On generating the schema we get an extra table “microarray_sample” for the many-to-many relationship from microarrays to samples.

• We notice that we don’t have anywhere that says which dye was used for which sample.– Could add this to the composite entity (need

to make the relationship explicit as a table).


Second Draft


Possible Change

• We might think that the sample-hybridization-microarray chain is getting a little complicated (we haven’t written any queries with it yet).

• Could combine microarray and hybridization into a single table “microarray_hybridization”.

• May depend on whether we have any other information to store in the microarray table e.g. type of slide used, procedure used, …


Oddities

• There’s something funny going on with the spot table. • There’s no primary key yet.• We have channel 1 and channel 2 – but no way of

matching channels to dyes.• Channel-dye mapping belongs in the hybridization table.• Also – is it OK to have columns ch1i and ch2i?

– If we add a translation from dye to channel in the microarray table we would still be left writing awkward queries – we have to select the column name based on the channel.

– On the downside – if we split each spot row into two we double the size of the table (and it’s big). And we violate second normal form.


Third Draft


Other Possible Changes

• Spot table is not in second normal form!

• We assumed that we have a list mapping microarray spots to genes.

• We could save space in the spot table by taking out the gene_id column and adding a spot number column to the gene table.

• Would this work if we had assumed more than one type of microarray?


Domains

• As mentioned in an earlier session you can have the database check that the values in your columns take a limited set of values.

• Prime candidates for this are the dye and channel columns.


Referential Integrity

• When we add relationships to the ER diagram that are (strictly) 1 to N DeZign adds a “constraint” to the table definition.

• This makes the database check insertions for valid references to other entities.

• E.g. when we insert a row into the microarray table the database will check that the experiment_id we give actually exists in the experiment table.– Means we have to add entries to the database in the correct

order: experiments before microarrays.

• You can add these constraints yourself by adding to the CREATE TABLE statement.


Example Query 1

• Get all the result data for a specific experiment.

• SELECT r.* FROM microarray m, microarray_spot rWHERE m.experiment_id = ? AND r.array_id = m.array_id

• SELECT r.*FROM microarray m JOIN microarray_spot r USING (array_id)WHERE m.experiment_id = ?

• Would need some interface for selecting the correct experiment id.– May be a web page that just lists all

experiments.


Example Query 2

• Get all the result data for a specific sample.

• SELECT r.* FROM sample s, hybridization h, microarray m, microarray_spot rWHERE s.sample_id = ? AND s.sample_id = h.sample_id AND h.array_id = m.array_id AND m.array_id = r.array_id AND h.channel = r.channel


Example Query 3

• Same as example 1 but add the gene accession identifier to the results.

• SELECT r.*, g.accessionFROM sample s, hybridization h, microarray m, microarray_spot r, gene gWHERE s.sample_id = ? AND s.sample_id = h.sample_id AND h.array_id = m.array_id AND m.array_id = r.array_id AND h.channel = r.channel AND g.gene_id = r.gene_id


Stanford Microarray Database Schema

• Stanford Microarray Database: Schema– They have included a lot more detail.

• Multiple arrays and types of array.• Distinction between the short piece of cDNA on the

array and the gene it represents.

– http://genome-www5.stanford.edu/schema

• More interested in the printing details than the samples hybridized to the array?


General Advice 1

• Create a simple design and try it out.– Change the things you don’t like.– If it is your own personal database the cost of

changing it may be small (depending on how much code depends on its structure).

– If lots of users have a copy the cost may be high (getting them all safely updated).

• The rules are more like guidelines really.


Exercises

• The microarray data import scripts from session 3 are wrong in that they don’t include any array identifiers in the data table. There is also no primary key in the table.– Add the array identifier and choose a primary key (change the

create table statement to include the key).

• These scripts paid no attention to columns in the input data that indicated whether the spot is “good” or not.– These columns are:

• “FAILED” indicating that PCR failed. (0=OK).• “IS_CONTAMINATED” indicates that the sample was contaminated

(Y/N/U=Yes/No/Unknown?)• “FLAG” indicates whether the spot was good or not.

– Add code to take account of these columns to the scripts.


Harder Exercise

• Update (one of) the scripts to use the microarray data schema designed in this session.

• Design your own database!

relational databases and sql session 1 an introduction

Documents

chris smith

relational model of

relational model history

table schema

relational database

relational databases

relational database

table item