Database Design

3.2.5.1. Database Design#

In this section, we’ll explore the fundamentals of database design in the SQL world.

At a high level, an SQL database can be thought of as a collection of tables stored together in a structured format. Each table has columns with fixed data types and rows representing individual records. Tables can also have defined relationships between them, allowing us to model complex, real-world data.

Table#

A SQL table is a structure used to store tabular data. Each:

Row represents a single record (observation)
Column represents a feature (field) with a defined data type
Columns may also have constraints like NOT NULL, UNIQUE, or PRIMARY KEY to enforce data integrity

Keys#

A pivotal concept in SQL is the idea of keys. Keys are used to uniquely identify rows in a table and define relationships between tables. There are several types of keys, each serving a specific purpose.

Primary Key#

A primary key is a column (or a set of columns) that uniquely identifies each record in a table. Key characteristics:

Must be unique for every record
Cannot contain NULL values
A table can have only one primary key
Commonly implemented using an id column

The primary key is often an auto-incrementing integer (INTEGER PRIMARY KEY AUTOINCREMENT in SQLite) or a UUID. While it’s possible to use strings as primary keys (like usernames or email addresses), numeric keys are usually more efficient.

Candidate Key#

A candidate key is any column (or set of columns) that can uniquely identify a record. Among all candidate keys, one is chosen as the primary key. Others remain as theoretical alternatives.

There is no SQL syntax to explicitly declare candidate keys - it’s a design concept rather than a database directive.

Example:#

Imagine a users table with:

user_id (unique system-generated ID)
email (unique, required for login)
phone (unique for verification)

All three can uniquely identify a user, so they are candidate keys. One (e.g., user_id) is chosen as the primary key.

Foreign Key#

A foreign key is a column in one table that refers to the primary key of another table. Foreign keys create relationships between tables and are essential for maintaining referential integrity.

They help model relationships such as:

One-to-many: One department has many employees
Many-to-one: Many orders belong to one customer
Many-to-many: Students enrolled in multiple courses (via a join table)

Normalization#

Normalization is the concept that helps reduce data redundancy and ensure data consistency.
It is based on the principle of having a single source of truth for each piece of information, and referring to that true value via links (foreign keys) instead of duplicating it.

Let’s look at an example.

A university has a list of professors.
Think of this as a table called Professors with each row containing details like:

Name
Age
Department

There is also another table that logs which professor taught which course in each semester.
This table stores details such as:

Course name
Semester
Year
Professor name

Imagine that Professor John Smyth had a typo in his name. Later, the Professors table is corrected to John Smith.
However, in the course_professor table, the old name John Smyth remains unchanged.

This inconsistency arises because the name was duplicated rather than linked.

Normalization solves this by avoiding redundant copies of data.
Instead of storing the professor name in course_professor, you would only store a foreign key reference to the Professors table.

Before normalization (redundant):

-- Professors table
| ProfessorID | Name          | Age |
|-------------|---------------|-----|
| 101         | John Smyth    | 45  |

-- course_professor table
| Course       | Semester | Year | ProfessorName |
|--------------|----------|------|---------------|
| Data Science | Fall     | 2023 | John Smyth    |

After normalization:

-- Professors table
| ProfessorID | Name         | Age |
|-------------|--------------|-----|
| 101         | John Smith   | 45  |

-- course_professor table
| Course       | Semester | Year | ProfessorID |
|--------------|----------|------|-------------|
| Data Science | Fall     | 2023 | 101         |

And you would join to the Professors table when you need the name:

SELECT c.Course, c.Semester, c.Year, p.Name
FROM course_professor c
JOIN Professors p ON c.ProfessorID = p.ProfessorID;

Note

With normalization comes the added complexity of querying multiple tables (joins), and in some systems this can also impact performance. This ties into the famous CAP theorem, which in broad terms states that distributed systems must trade off between:

Consistency (always up-to-date and accurate)
Availability (always responsive)
Partition Tolerance (resilient to network issues)

Deciding how much to normalize often depends on your system’s requirements and how you balance these trade-offs.

We won’t cover the details of normalization in this course, but you can read more about it here: Wikipedia – Database Normalization

Indexes#

When you are reading a textbook, you have two options:

Read it sequentially from start to finish
Jump directly to a topic you remember by looking it up in the table of contents (the index)

Similarly, in databases, an index works like a lookup guide. Indexes help speed up queries by using efficient data structures (such as B-trees) that allow fast retrieval of rows without scanning the entire table.

As a data scientist, it is helpful to know which columns have indexes, so you can tailor your queries for better performance.

We will see the performance boost in Example: Measuring Query Speed with and without an Index.

Indexes are one of the most important tools to optimize database performance. However:

They consume extra disk space.
They can slow down INSERT and UPDATE operations (because the index must be updated whenever data changes).

Always consider whether the speedup for reads outweighs the cost of maintaining the index.