Showing posts with label RDBMS. Show all posts
Showing posts with label RDBMS. Show all posts

Friday, July 27, 2012

Understanding JOINs MySQL

“JOIN” is a SQL keyword used to query data from two or more related tables. Unfortunately, the concept is regularly explained using abstract terms or differs between database systems. It often confuses me. Developers cope with enough confusion, so this is my attempt to explain JOINs briefly and succinctly to myself and anyone who’s interested.

Related Tables

MySQL, PostgreSQL, Firebird, SQLite, SQL Server and Oracle are relational database systems. A well-designed database will provide a number of tables containing related data. A very simple example would be users (students) and course enrollments:

‘user’ table:

id
name
course
1
Alice
1
2
Bob
1
3
Caroline
2
4
David
5
5
Emma
(NULL)

MySQL table creation code:

  1. CREATE TABLE `user` (  
  2.     `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,  
  3.     `name` varchar(30) NOT NULL,  
  4.     `course` smallint(5) unsigned DEFAULT NULL,  
  5.     PRIMARY KEY (`id`)  
  6. ) ENGINE=InnoDB;  

The course number relates to a subject being taken in a course table…

‘course’ table:

id
name
1
HTML5
2
CSS3
3
JavaScript
4
PHP
5
MySQL

MySQL table creation code:

  1. CREATE TABLE `course` (  
  2.     `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,  
  3.     `name` varchar(50) NOT NULL,  
  4.     PRIMARY KEY (`id`)  
  5. ) ENGINE=InnoDB;  

Since we’re using InnoDB tables and know that user.course and course.id are related, we can specify a foreign key relationship:

  1. ALTER TABLE `user`  
  2. ADD CONSTRAINT `FK_course`  
  3. FOREIGN KEY (`course`) REFERENCES `course` (`id`)  
  4. ON UPDATE CASCADE;  

In essence, MySQL will automatically:

  • re-number the associated entries in the user.course column if the course.id changes
  • reject any attempt to delete a course where users are enrolled.

important: This is terrible database design!

This database is not efficient. It’s fine for this example, but a student can only be enrolled on zero or one course. A real system would need to overcome this restriction — probably using an intermediate ‘enrollment’ table which mapped any number of students to any number of courses.

JOINs allow us to query this data in a number of ways.


The most frequently used clause is INNER JOIN. This produces a set of records which match in both the user and course tables, i.e. all users who are enrolled on a course:

  1. SELECT user.name, course.name  
  2. FROM `user`  
  3. INNER JOIN `course` on user.course = course.id;  

Result:

user.name
course.name
Alice
HTML5
Bob
HTML5
Carline
CSS3
David
MySQL



What if we require a list of all students and their courses even if they’re not enrolled on one? A LEFT JOIN produces a set of records which matches every entry in the left table (user) regardless of any matching entry in the right table (course):

  1. SELECT user.name, course.name  
  2. FROM `user`  
  3. LEFT JOIN `course` on user.course = course.id;  

Result:

user.name
course.name
Alice
HTML5
Bob
HTML5
Carline
CSS3
David
MySQL
Emma
(NULL)


Perhaps we require a list all courses and students even if no one has been enrolled? A RIGHT JOIN produces a set of records which matches every entry in the right table (course) regardless of any matching entry in the left table (user):

  1. SELECT user.name, course.name  
  2. FROM `user`  
  3. RIGHT JOIN `course` on user.course = course.id;  

Result:

user.name
course.name
Alice
HTML5
Bob
HTML5
Carline
CSS3
(NULL)
JavaScript
(NULL)
PHP
David
MySQL

RIGHT JOINs are rarely used since you can express the same result using a LEFT JOIN. This can be more efficient and quicker for the database to parse:

  1. SELECT user.name, course.name  
  2. FROM `course`  
  3. LEFT JOIN `user` on user.course = course.id;  

We could, for example, count the number of students enrolled on each course:

  1. SELECT course.name, COUNT(user.name)  
  2. FROM `course`  
  3. LEFT JOIN `user` ON user.course = course.id  
  4. GROUP BY course.id;  

Result:

course.name
count()
HTML5
2
CSS3
1
JavaScript
0
PHP
0
MySQL
1





Our last option is the OUTER JOIN which returns all records in both tables regardless of any match. Where no match exists, the missing side will contain NULL.

OUTER JOIN is less useful than INNER, LEFT or RIGHT and it’s not implemented in MySQL. However, you can work around this restriction using the UNION of a LEFT and RIGHT JOIN, e.g.

  1. SELECT user.name, course.name  
  2. FROM `user`  
  3. LEFT JOIN `course` on user.course = course.id  
  4. UNION  
  5. SELECT user.name, course.name  
  6. FROM `user`  
  7. RIGHT JOIN `course` on user.course = course.id;  

Result:

user.name
course.name
Alice
HTML5
Bob
HTML5
Carline
CSS3
David
MySQL
Emma
(NULL)
(NULL)
JavaScript
(NULL)
PHP

I hope that gives you a better understanding of JOINs and helps you write more efficient SQL queries.

Tuesday, June 26, 2012

Codd's Rule (Defining a RDBMS)


Rule (0):

The system must qualify as relational, as a database, and as a management system.
For a system to qualify as a relational database management system (RDBMS), that system must use its relational facilities (exclusively) to manage the database.


Rule 1: The information rule:
All information in the database is to be represented in only one way, namely by values in column positions within rows of tables.


Rule 2: The guaranteed access rule:

All data must be accessible. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.


Rule 3: Systematic treatment of null values:
The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.


Rule 4: Active online catalog based on the relational model:
The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database's structure (catalog) using the same query language that they use to access the database's data.


Rule 5: The comprehensive data sublanguage rule:
The system must support at least one relational language that

1.    Has a linear syntax

2.    Can be used both interactively and within application programs,

3.    Supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval)

4.    Security and integrity constraints, and transaction management operations (begin, commit, and rollback).


Rule 6: The view updating rule:
All views that are theoretically updatable must be updatable by the system.


Rule 7: High-level insert, update, and delete:
The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.


Rule 8: Physical data independence:
Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure.


Rule 9: Logical data independence:
Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.


Rule 10: Integrity independence:
Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.


Rule 11: Distribution independence:
The distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully :
when a distributed version of the DBMS is first introduced; and
when existing distributed data are redistributed around the system.


Rule 12: The nonsubversion rule:
If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.