What is Normal Form and do you need to respect them?

[Version en Français disponible ici]

Comparing RDBMS and NoSQL, the schema part is often used. In NoSQL, you see "schema-less databases" or "no schema" as an advantage. If it could be true sometimes, your applications will rely on a schema in any way. You always end up presenting data to the end-user, and you rely on the UI to correlate data between screens.

The schema definition is a huge point when using a database as it will help (or not!) performance, consistency, storage, etc. When it comes to designing a schema, you will, usually, rely on a Normal Form (NF) to ensure it is designed properly. Just like when you're using UML for your applications.

What is Normal Form?

NF are a set of rules to define your schema, how you store your data, to prevent issues when it comes to reading what you stored.

I won't discuss all the existing principles and NF, but just the 5 most used ones.

1NF

First NF is the easy one. All your tables must have a Primary Key, and you can't mix datatypes or use composite values in a column.

That's it!

2NF

To respect the 2NF, all non-key attributes must depend on the entire PK.

If you have a table where you have a product, a warehouse, and the quantity:

If you have the PK on ProductID and Warehouse, everything is fine. But if you decide to add the rating for the product, you won't respect 2NF and are exposed to an issues.

Issue when updating the rating of a product

3NF

For the 3NF, every attribute must depend on the key, and only on the PK!

It means you'll need to introduce more relations between your data to ensure consistency.

Let's imagine you manage a metadata store for a game. For this game, you decide to have a table to store the players' level like this:

For a player to reach a certain level, he needs to gain a certain rating. The game designer defined that from level 1 to 4 it'll be low, 5-8 will be medium and 9-10 will be advanced. If you add a column to the table, you won't respect the 3NF and risk inconsistency.

You can prevent it in the logic of your application, but adopting the 3NF will prevent this kind of issue by-design.

4NF

4NF introduces a new rule on how you manage the PK.

Imagine you're managing a catalog for cars. You can define your cars with model name, colors available, and a type. Not all colors are available for all models. Same for the types. But for a model all colors will be the same whatever the type is.

So you can have Model Z in blue and green and sedan or SUV style for example. A table not respecting 4NF could looks like this:

But what happens when a new color is available for the model Z? To respect 4NF, you need to split in 2 tables, one with available colors and one for types.

5NF

Once again, it's all about attributes and PK management. in 5NF, every join dependency have to rely on a clear PK.

The idea is to prevent redundant tuples, and like the example in 4NF, you'll end up splitting your tables once again.

Schema-less or Normal Form

Normal Form definition are really useful to prevent inconsistency or anomalies in your dataset by design. Redundancy in your data is not the same problem nowadays due to storage cost decrease. But preventing anomalies is always better to avoid customer/user issues with your service or application. Don't forget integrity and consistency is not only on a single row (or document), it's about relation between your data. As soon as you'll use a normal form, you'll benefit from this by design by simply implementing Foreign Key constraints.

Working with document database will generally lead to repeating yourself in multiple documents. To avoid anomalies, you simply store all relevant and/or linked data in a single document. Don't worry about redundancy, storage is cheap. But is network cheap? If you're hosting your application in the cloud, egress cost can be expensive. If you don't have a way to retrieve fewer attributes from your documents, network cost will increase... dramatically! Also, repeating yourself also means that sometimes the same attributes won't have the same value in multiple documents and could drive to inconsistent information in different screens of your application.

When it comes to RDBMS, the cost will be paid on compute. Joins could be easily managed between 2 tables, but when you have 10 tables, latency of the query will increase.

Denormalizing

You can denormalize your schema in order to improve performance. In this case it'll simply store redundant data depending on specific querying needs. Usually, this is done only for:

  1. Query performance: avoid complex joins by querying a single table by the key.
  2. Simplifyng schema: limit complex schema with cascading foreign keys.
  3. Handling write-heavy loads: limiting the number of writes required for some inserts/updates could help the database handle the load.

Keep in mind that denormalization will increase the risk of data inconsistencies and making it harder to maintain the database schema to ensure it's still supporting your queries.

What to do?

"What you must learn is that these rules are no different than the rules of a computer system. Some of them can be bent. Others can be broken." - Morpheus, 1999

Nothing is written in the stone about NF. Working on an application or API means understanding what you need to achieve, where are your hot paths, what are the most used data, etc.

Using views or covering indexes can help increase performance by formatting or storing the data differently while keeping them clean when it comes to write them.

Applying CQRS pattern in your design will definitely help segregating your data access needs, and think about your schema design. And if you want to keep some room for evolving data, you can benefits from using JSONB data type directly.

In short, what you need to do is what's best for your specific needs and don't follow blindly a set of rules. Go back to the basics of system design, and look at what your application need. Define your schema with all the components for retrieving the data like indexes and views while respecting your needs and constraints of performance and storage. Don't follow any NF strictly, just apply them where its useful and can avoid any issue, and work around them when required for your system to work fine.

We've seen multiple successful services working on Document DB, Graph DB, etc. so don't fear not respecting NF when using a RDBMS. Just think about what you need and how your specific database is working.

Write/Read ratio have a huge impact on your decision.

If your application is at 90% writes, do you really need sub-millisecond response time on reads? Do you need to design your schema in regards of any NF and increase latency in writes? In this kind of use cases, you can avoid having too long writes (and subsequent contention on transactions) by simply not use normalized schema.

When reads are at 90%, indexes and views will usually do the trick. You'll experience latencies on writes, but reads won't be a road-block for the application, especially when using covering indexes (pay attention to storage impact).

Finally, on a 50-50 ratio, have a clear understanding of the application needs. Not normalizing a schema doesn't mean you can't have normallized data in it! You can normalized where it's better, denormalize where usefull, and have a "best of both world" in your schema.