Normalization Vs. Denormalization in Database

By | February 20, 2016

Normalized vs. Denormalized

Normalization:
Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.

Denormalization:

Denormalization is the process of attempting to optimise the read performance of a database by adding redundant data or by grouping data. In some cases, denormalisation helps cover up the inefficiencies inherent in relational database software. A relational normalised database imposes a heavy access load over physical storage of data even if it is well tuned for high performance.

Normalization: The first three forms
First Normal Form:
No repeating groups. As an example, it might be tempting to make an invoice table with columns for the first, second, and third line item (see above). This violates the first normal form, and would result in large rows, wasted space (where an invoice had less than the maximum number of line items), and *horrible* SQL statements with a separate join for each repetition of the column. First form normalization requires you make a separate line item table, with it’s own key (in this case the combination of invoice number and line number) (See below).
Second Normal Form:
Each column must depend on the *entire* primary key. As an example, the customer information could be put in the line item table (see above). The trouble with that is that the customer goes with the invoice, not with each line on the invoice. Putting customer information in the line item table will cause redundant data, with it’s inherant overhead and difficult modifications. Second form normalization requires you place the customer information in the invoice table (see below).
Third Normal Form: 
Each column must depend on *directly* on the primary key. As an example, the customer address could go in the invoice table (see above), but this would cause data redundancy if several invoices were for the same customer. It would also cause an update nightmare when the customer changes his address, and would require extensive programming to insert the address every time an existing customer gets a new invoice. Third form normalization requires the customer address go in a separate customer table with its own key (customer), with only the customer identifier in the invoice table (see below).

 

What is the difference between Normalized & denormalized, and when do you use which?

When most of us design our databases we tend to think of them as related (and un-related) tables that contain data. It is common to normalize our tables in order to create a cleaner, more manageable, and intuative Database design. Normalized means separate tables via foreign key relationship, there are however things that we should consider before we normalize our system:

Historical Data:

One of the most important is historical data. When we normalize our system we open the door for historically inaccuracy. What’s that mean? It means that we can change the data in the related table, causing us to lose the accuracy of the data thats related to it.

 

What do you mean “lose the accuracy”? Lets look at an example: We have a Member table which holds a foreign key referrence to the States table. If we change the data in States table, which is referrenced by the Member table, we have no way of knowing what that the Member record originally contained (like change Republic of Congo to Republic of China).

 

So why should we care, isn’t this the whole reason why we normalize our database? Yes and No. If we know that we will never have the need to keep an audit trail or report on the data then surely normalizing it is perfectly suitable. However we all know that software requirment change, and there may come a time when the Marketing Department wants to know the history of how its members data has changed over time, normalization would destroy this historical record (unless we some how dig it out of a backup, which would be a major pain in the rear), denormalized does not.

 

Reporting:

If you think that you will ever have to do any kind of Data Mining or Reporting on your data then avoiding over-normalization is important. Reporting on, or performing calculations on, large amounts of data requires that you denormalize your database. Anticipating this during the design phases can save you a lot of time and grief later on.

 

Pros and Cons of a Normalized database design.

Normalized databases fair very well under conditions where the applications are write-intensive and the write-load is more than the read-load. This is because of the following reasons:

  • Normalized tables are usually smaller and have a smaller foot-print because the data is divided vertically among many tables. This allows them to perform better as they are small enough to get fit into the buffer.
  • The updates are very fast because the data to be updated is located at a single place and there are no duplicates.
  • Similarly the inserts are very fast because the data has to be inserted at a single place and does not have to be duplicated.
  • The selects are fast in cases where data has to be fetched from a single table, because normally normalized tables are small enough to get fit into the buffer.
  • Because the data is not duplicated so there is less need for heavy duty group by or distinct queries.

Although there seems to be much in favor of normalized tables, with all the pros outlined above, but the main cause of concern with fully normalized tables is that normalized data means joins between tables. And this joining means that read operations have to suffer because indexing strategies do not go well with table joins.

Now lets have a look at the pros and cons of a denormalized database design.

Pros and cons of denormalized database design.

Denormalized databases fair well under heavy read-load and when the application is read intensive. This is because of the following reasons:

  • The data is present in the same table so there is no need for any joins, hence the selects are very fast.
  • A single table with all the required data allows much more efficient index usage. If the columns are indexed properly, then results can be filtered and sorted by utilizing the same index. While in the case of a normalized table, since the data would be spread out in different tables, this would not be possible.

Although for reasons mentioned above selects can be very fast on denormalized tables, but because the data is duplicated, the updates and inserts become complex and costly.

Having said that neither one of the approach can be entirely neglected, because a real world application is going to have both read-loads and write-loads. Hence the correct way would be to utilize both the normalized and denormalized approaches depending on situations.

Using normalized and denormalized approaches together.

The most common way of mixing denormalized and normalized approaches is to duplicate related columns from one table into another table. Let me show you by example:

Suppose you have a products table and an orders table.
The normalized approach would be to only have the product_id in the orders table and all the other product related information in the products table.

But that would make the query that filters by product_name and sorts by order_date inefficient because both are stored in different tables.

In a fully normalized schema, such a query would be performed in the following manner:

SELECT product_name, order_dateFROM orders INNER JOIN products USING(product_id)WHERE product_name like ‘A%’ORDER by order_date DESC

As you can see MySQL here will have to scan the order_date index on the orders table and then compare the corresponding product_name in the products table to see if the name starts with A.

The above query can be drastically improved by denormalizing the schema a little bit, so that the orders table now includes the product_name column as well.

SELECT product_name, order_dateFROM ordersWHERE product_name like ‘A%’ORDER by order_date DESC

See how the query has become much simpler, there is no join now and a single index on columnsproduct_name, order_date can be used to do the filtering as well as the sorting.

So can both the techniques be used together? Yes they can be, because real word applications have a mix of read and write loads.

Final words.

Although, denormalized schema can greatly improve performance under extreme read-loads but the updates and inserts become complex as the data is duplicate and hence has to be updated/inserted in more than one places.

One clean way to go about solving this problem is through the use of triggers. For example in our case where the orders table has the product_name column as well, when the value of product_name has to be updated, then it can simply be done in the following way:

  • Have a trigger setup on theproducts table that updates the product_name on any update to theproducts
  • Execute the update query on theproducts  The data would automatically be updated in theorders table because of the trigger.

However, when denormalizing the schema, do take into consideration, the number of times you would be updating records compared to the number of times you would be executing SELECTs. When mixing normalization and denormalization, focus on denormalizing tables that are read intensive, while tables that are write intensive keep them normalized.

Some Good Reasons Not To Normalize

That said, there are some good reasons not to normalize your database. Let’s look at a few:

  1. Joins are expensive. Normalizing your database often involves creating lots of tables. In fact, you can easily wind up with what might seem like a simple query spanning five or ten tables. If you’ve ever tried doing a five-table join, you know that it works in principle, but its painstakingly slow in practice. If you’re building a web application that relies upon multiple-join queries against large tables, you might find yourself thinking: “If only this database wasn’t normalized!” When you hear that thought in your head, it’s a good time to consider denormalizing. If you can stick all of the data used by that query into a single table without really jeopardizing your data integrity, go for it! Be a rebel and denormalize your database. You won’t look back!
  • Normalized design is difficult. If you’re working with a complex database schema, you’ll probably find yourself banging your head against the table over the complexity of normalization. As a simple rule of thumb, if you’ve been banging your head against the table for an hour or two trying to figure out how to move to the fourth normal form, you might be taking normalizationtoo far. Step back and ask yourself if it’s really worth continuing.

 

  1. Quick and dirty should be quick and dirty. If you’re just developing a prototype, just do whatever works quickly. Really. It’s OK. Rapid application development is sometimes more important than elegant design. Just remember to go back and take a careful look at your design once you’re ready to move beyond the prototyping phase. The price you pay for a quick and dirty database design is that you might need to throw it away and start over when it’s time to build for production.

 

Words of Caution

Database normalization is generally a good idea. You should try to follow the principles of normalization when it seems reasonable to do so. We  shouldn’t let a quasi-religious fanaticism about normalization prevent from doing the job in the best way possible.

If we   choose to vary from the rules of normalization, need to be extra vigilant about the way to enforce database integrity. If we store redundant information, put triggers and other controls in place to make sure that information stays consistent.

 

Leave a Reply

Your email address will not be published. Required fields are marked *


*