SQL Database Optimization| Normalization
Normalization
Normalization is a fundamental concept in database design aimed at organizing data in a database efficiently to reduce redundancy and improve data integrity. It involves structuring tables and their relationships according to a set of rules known as "normal forms."
We can assume it as organizing your digital files on a computer. Instead of having the same document saved in multiple folders, you keep one master copy and create shortcuts or links to it from other locations.
Why is Normalization Important?
Reduces Data Redundancy: Avoids storing the same information in multiple places, saving storage space and reducing the chance of inconsistencies.
Improves Data Integrity: Ensures that data is accurate and consistent across the database. Changes to a piece of information only need to be made in one place.
Eliminates Data Anomalies: Prevents issues that can arise during data modification,
such as:
Insertion Anomalies: Difficulty in adding new data without also adding redundant information.
Update Anomalies: Needing to update the same information in multiple rows.
Deletion Anomalies: Unintentionally losing related data when deleting a record.
Simplifies Database Design: Provides a structured approach to organizing data, making the database easier to understand and maintain.
Enhances Query Performance: Well-normalized databases often lead to more efficient querying.
Supports Scalability: Makes it easier to adapt the database to future needs.
Types of Normalization (Normal Forms):
Normalization is a progressive process, with each normal form building upon the previous one. The most common normal forms are:
First Normal Form (1NF):
- Each column in a table contains atomic (indivisible) values.
- There are no repeating groups of columns within a table.
- Each row is uniquely identified by a primary key.
Original Table (Not in 1NF): Students
StudentID | StudentName | CoursesEnrolled |
---|---|---|
101 | Alice | Math, Physics |
102 | Bob | Chemistry |
103 | Carol | History, English, Art |
Normalized Table (in 1NF): Students
StudentID | StudentName |
---|---|
101 | Alice |
102 | Bob |
103 | Carol |
Normalized Table (in 1NF): StudentCourses
EnrollmentID | StudentID | CourseName |
---|---|---|
1 | 101 | Math |
2 | 101 | Physics |
3 | 102 | Chemistry |
4 | 103 | History |
5 | 103 | English |
6 | 103 | Art |
By breaking down the multi-valued Courses Enrolled into individual rows in a separate table, we've successfully brought the data into First Normal Form, laying the groundwork for further normalization if needed
Second Normal Form (2NF):
- The table must already be in 1NF.
- All non-key attributes (columns that are not part of the primary key) must be fully functionally dependent on the entire primary key. This means a non-key attribute should depend on all parts of a composite primary key, not just a part of it.
Original Table (in 1NF, but Not in 2NF): OrderDetails
This table stores details about customer orders, including the product details.
OrderID | ProductID | ProductName | ProductPrice | Quantity |
---|---|---|---|---|
1 | 101 | Laptop | 1200.00 | 1 |
1 | 102 | Mouse | 25.00 | 2 |
2 | 101 | Laptop | 1200.00 | 1 |
3 | 103 | Keyboard | 75.00 | 1 |
Normalizing to 2NF:
To bring this table into 2NF, we need to remove the partial dependencies by creating new tables for the attributes that depend only on a part of the primary key.
Normalized Tables (in 2NF):
- OrderItems (Contains the full functional dependency)
OrderID | ProductID | Quantity |
---|---|---|
1 | 101 | 1 |
1 | 102 | 2 |
2 | 101 | 1 |
3 | 103 | 1 |
-
Primary Key: (OrderID, ProductID)
-
Non-Key Attribute: Quantity (fully dependent on (OrderID, ProductID))
- Products (Contains the partial dependencies)
ProductID | ProductName | ProductPrice |
---|---|---|
101 | Laptop | 1200.00 |
102 | Mouse | 25.00 |
103 | Keyboard | 75.00 |
-
Primary Key: ProductID
-
Non-Key Attributes: ProductName, ProductPrice (both fully dependent on ProductID)
By doing this, we eliminate redundancy (e.g., "Laptop" and "1200.00" are no longer repeated for every order that includes a laptop) and improve data integrity. If the price of a laptop changes, we only need to update it in one place (the Products table) instead of potentially many rows in the OrderDetails table, thus preventing update anomalies.
Third Normal Form (3NF):
The table must already be in 2NF.
All non-key attributes are not transitively dependent on the primary key. In other words, a non-key attribute should depend directly on the primary key and not on another non-key attribute.
Transitive Dependency: A transitive dependency exists when a non-key attribute is indirectly dependent on the primary key through another non-key attribute. If we have A→B and B→C, then A→C is a transitive dependency. 3NF aims to eliminate this type of dependency.
Original Table (in 2NF, but Not in 3NF): Employees
This table stores employee information, including their department details.
EmployeeID | EmployeeName | DepartmentName | DepartmentLocation |
---|---|---|---|
1 | Alice | HR | New York |
2 | Bob | Sales | London |
3 | Carol | HR | New York |
4 | David | Marketing | Paris |
Normalized Tables (in 3NF):
- Employees
EmployeeID | EmployeeName | DepartmentName |
---|---|---|
1 | Alice | HR |
2 | Bob | Sales |
3 | Carol | HR |
4 | David | Marketing |
-
Primary Key: EmployeeID
-
Non-Key Attributes: EmployeeName, DepartmentName (where DepartmentName now acts as a foreign key referencing the Departments table).
- Departments
DepartmentName | DepartmentLocation |
---|---|
HR | New York |
Sales | London |
Marketing | Paris |
-
Primary Key: DepartmentName
-
Non-Key Attribute: DepartmentLocation (directly dependent on DepartmentName)
By doing this, we eliminate redundancy (e.g., "New York" is no longer repeated for every HR employee) and prevent update anomalies. If the location of the HR department changes, we only need to update it in one place (the Departments table) instead of potentially many rows in the Employees table. This ensures data consistency and integrity.
Fourth Normal Form (4NF)
Beyond 3NF, there are higher normal forms like Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF), which address more complex dependencies and are less commonly encountered in typical database design. Achieving 3NF or BCNF is often sufficient for most applications, providing a good balance between data integrity and performance.
To be in 4NF, a table must satisfy two conditions:
-
(BCNF is a stricter version of 3NF. If a table is in BCNF, it's also in 3NF. BCNF states that for every non-trivial functional dependency X→Y, X must be a super key. This means that if an attribute or set of attributes determines another attribute, then the determining attribute(s) must be a candidate key.)
-
It must not contain any non-trivial multi-valued dependencies. A multi-valued dependency A↠B exists if, for a given value of A, there is a fixed set of values for B, and this set of B values is independent of any other attributes in the table.