Database Star

Database Normalization: A Step-By-Step-Guide With Examples

Database normalisation is a concept that can be hard to understand.

But it doesn’t have to be.

In this article, I’ll explain what normalisation in a DBMS is and how to do it, in simple terms.

By the end of the article, you’ll know all about it and how to do it.

So let’s get started.

Want to improve your database modelling skills? Click here to get my Database Normalisation Checklist: a list of things to do as you normalise or design your database!

Table of Contents

What Is Database Normalization?

Database normalisation, or just normalisation as it’s commonly called, is a process used for data modelling or database creation, where you organise your data and tables so it can be added and updated efficiently.

It’s something a person does manually, as opposed to a system or a tool doing it. It’s commonly done by database developers and database administrators.

It can be done on any relational database , where data is stored in tables that are linked to each other. This means that normalization in a DBMS (Database Management System) can be done in Oracle, Microsoft SQL Server, MySQL, PostgreSQL and any other type of database.

To perform the normalization process, you start with a rough idea of the data you want to store, and apply certain rules to it in order to get it to a more efficient form.

I’ll show you how to normalise a database later in this article.

Why Normalize a Database?

So why would anyone want to normalize their database?

Why do we want to go through this manual process of rearranging the data?

There are a few reasons we would want to go through this process:

  • Make the database more efficient
  • Prevent the same data from being stored in more than one place (called an “insert anomaly”)
  • Prevent updates being made to some data but not others (called an “update anomaly”)
  • Prevent data not being deleted when it is supposed to be, or from data being lost when it is not supposed to be (called a “delete anomaly”)
  • Ensure the data is accurate
  • Reduce the storage space that a database takes up
  • Ensure the queries on a database run as fast as possible

Normalization in a DBMS is done to achieve these points. Without normalization on a database, the data can be slow, incorrect, and messy .

Data Anomalies

Some of these points above relate to “anomalies”.

An anomaly is where there is an issue in the data that is not meant to be there . This can happen if a database is not normalised.

Let’s take a look at the different kinds of data anomalies that can occur and that can be prevented with a normalised database.

Our Example

We’ll be using a student database as an example in this article, which records student, class, and teacher information.

Let’s say our student database looks like this:

This table keeps track of a few pieces of information:

  • The student names
  • The fees a student has paid
  • The classes a student is taking, if any

This is not a normalised table, and there are a few issues with this.

Insert Anomaly

An insert anomaly happens when we try to insert a record into this table without knowing all the data we need to know.

For example, if we wanted to add a new student but did not know their course name.

The new record would look like this:

We would be adding incomplete data to our table, which can cause issues when trying to analyse this data.

Update Anomaly

An update anomaly happens when we want to update data, and we update some of the data but not other data.

For example, let’s say the class Biology 1 was changed to “Intro to Biology”. We would have to query all of the columns that could have this Class field and rename each one that was found.

There’s a risk that we miss out on a value, which would cause issues.

Ideally, we would only update the value once, in one location.

Delete Anomaly

A delete anomaly occurs when we want to delete data from the table, but we end up deleting more than what we intended.

For example, let’s say Susan Johnson quits and her record needs to be deleted from the system. We could delete her row:

But, if we delete this row, we lose the record of the Biology 2 class, because it’s not stored anywhere else. The same can be said for the Medicine course.

We should be able to delete one type of data or one record without having impacts on other records we don’t want to delete.

What Are The Normal Forms?

The process of normalization involves applying rules to a set of data. Each of these rules transforms the data to a certain structure, called a normal form .

There are three main normal forms that you should consider (Actually, there are six normal forms in total, but the first three are the most common).

Whenever the first rule is applied, the data is in “ first normal form “. Then, the second rule is applied and the data is in “ second normal form “. The third rule is then applied and the data is in “ third normal form “.

Fourth and fifth normal forms are then achieved from their specific rules.

Alright, so there are three main normal forms that we’re going to look at. I’ve written a post on designing a database , but let’s see what is involved in getting to each of the normal forms in more detail.

What Is First Normal Form?

First normal form is the way that your data is represented after it has the first rule of normalization applied to it. Normalization in DBMS starts with the first rule being applied – you need to apply the first rule before applying any other rules.

Let’s start with a sample database. In this case, we’re going to use a student and teacher database at a school. We mentioned this earlier in the article when we spoke about anomalies, but here it is again.

Our Example Database

We have a set of data we want to capture in our database, and this is how it currently looks. It’s a single table called “student” with a lot of columns.

Everything is in one table.

How can we normalise this?

We start with getting the data to First Normal Form.

To apply first normal form to a database, we look at each table, one by one, and ask ourselves the following questions of it:

Does the combination of all columns make a unique row every single time?

What field can be used to uniquely identify the row?

Let’s look at the first question.

No. There could be the same combination of data, and it would represent a different row. There could be the same values for this row and it would be a separate row (even though it is rare).

The second question says:

Is this the student name? No, as there could be two students with the same name.

Address? No, this isn’t unique either.

Any other field?

We don’t have a field that can uniquely identify the row.

If there is no unique field, we need to create a new field. This is called a primary key, and is a database term for a field that is unique to a single row. (Related: The Complete Guide to Database Keys )

When we create a new primary key, we can call it whatever we like, but it should be obvious and consistently named between tables. I prefer using the ID suffix, so I would call it student ID.

This is our new table:

Student ( student ID , student name, fees paid, date of birth, address, subject 1, subject 2, subject 3, subject 4, teacher name, teacher address, course name)

This can also be represented in an Entity Relationship Diagram (ERD):

Normalisation ERD Examples - 1NF

The way I have written this is a common way of representing tables in text format. The table name is written, and all of the columns are shown in brackets, with the primary key underlined.

This data is now in first normal form.

This example is still in one table, but it’s been made a little better by adding a unique value to it.

Want to find a tool that creates these kinds of diagrams? There are many tools for creating these kinds of diagrams. I’ve listed 76 of them in this guide to Data Modeling Tools , along with reviews, price, and other features. So if you’re looking for one to use, take a look at that list.

What Is Second Normal Form?

The rule of second normal form on a database can be described as:

  • Fulfil the requirements of first normal form
  • Each non-key attribute must be functionally dependent on the primary key

What does this even mean?

It means that the first normal form rules have been applied. It also means that each field that is not the primary key is determined by that primary key , so it is specific to that record. This is what “functional dependency” means.

Let’s take a look at our table.

Are all of these columns dependent on and specific to the primary key?

The primary key is student ID, which represents the student. Let’s look at each column:

  • student name: Yes, this is dependent on the primary key. A different student ID means a different student name.
  • fees paid: Yes, this is dependent on the primary key. Each fees paid value is for a single student.
  • date of birth: Yes, it’s specific to that student.
  • address: Yes, it’s specific to that student.
  • subject 1: No, this column is not dependent on the student. More than one student can be enrolled in one subject.
  • subject 2: As above, more than one subject is allowed.
  • subject 3: No, same rule as subject 2.
  • subject 4: No, same rule as subject 2
  • teacher name: No, the teacher name is not dependent on the student.
  • teacher address: No, the teacher address is not dependent on the student.
  • course name: No, the course name is not dependent on the student.

We have a mix of Yes and No here. Some fields are dependent on the student ID, and others are not.

How can we resolve those we marked as No?

Let’s take a look.

First, the subject 1 column. It is not dependent on the student, as more than one student can have a subject, and the subject isn’t a part of the definition of a student.

So, we can move it to a new table:

Subject (subject name)

I’ve called it subject name because that’s what the value represents. When we are writing queries on this table or looking at diagrams, it’s clearer what subject name is instead of using subject.

Now, is this field unique? Not necessarily. Two subjects could have the same name and this would cause problems in our data.

So, what do we do? We add a primary key column, just like we did for student. I’ll call this subject ID, to be consistent with the student ID.

Subject ( subject ID , subject name)

This means we have a student table and a subject table. We can do this for all four of our subject columns in the student table, removing them from the student table so it looks like this:

Student ( student ID , student name, fees paid, date of birth, address, teacher name, teacher address, course name)

But they are in separate tables. How do we link them together?

We’ll cover that shortly. For now, let’s keep going with our student table.

The next column we marked as No was the Teacher Name column. The teacher is separate to the student so should be captured separately. This means we should move it to its own table.

Teacher (teacher name)

We should also move the teacher address to this table, as it’s a property of the teacher. I’ll also rename teacher address to be just address.

Teacher (teacher name, address)

Just like with the subject table, the teacher name and address is not unique. Sure, in most cases it would be, but to avoid duplication we should add a primary key. Let’s call it teacher ID,

Teacher ( teacher ID , teacher name, address)

The last column we have to look at was the Course Name column. This indicates the course that the student is currently enrolled in.

While the course is related to the student (a student is enrolled in a course), the name of the course itself is not dependent on the student.

So, we should move it to a separate table. This is so any changes to courses can be made independently of students.

The course table would look like this:

Course (course name)

Let’s also add a primary key called course ID.

Course ( course ID , course name)

We now have our tables created from columns that were in the student table. Our database so far looks like this:

Student ( student ID , student name, fees paid, date of birth, address) Subject ( subject ID , subject name) Teacher ( teacher ID , teacher name, address) Course ( course ID , course name)

Using the data from the original table, our data could look like this:

How do we link these tables together? We still need to know which subjects a student is taking, which course they are in, and who their teachers are.

Foreign Keys in Tables

We have four separate tables, capturing different pieces of information. We need to capture that students are taking certain courses, have teachers, and subjects. But the data is in different tables.

How can we keep track of this?

We use a concept called a foreign key.

A foreign key is a column in one table that refers to the primary key in another table . Related: The Complete Guide to Database Keys .

It’s used to link one record to another based on its unique identifier, without having to store the additional information about the linked record.

Here are our two tables so far:

Student ( student ID , student name, fees paid, date of birth, address) Subject ( subject ID , subject name) Teacher ( teacher ID , teacher name, teacher address) Course ( course ID , course name)

To link the two tables using a foreign key, we need to put the primary key (the underlined column) from one table into the other table.

Let’s start with a simple one: students taking courses. For our example scenario, a student can only be enrolled in one course at a time, and a course can have many students.

We need to either:

  • Add the course ID from the course table into the student table
  • Add the student ID from the student table into the course table

But which one is it?

In this situation, I ask myself a question to work out which way it goes:

Does a table1 have many table2s, or does a table2 have many table1s?

If it’s the first, then table1 ID goes into table 2, and if it’s the second then table2 ID goes into table1.

So, if we substitute table1 and table2 for course and student:

Does a course have many students, or does a student have many courses?

Based on our rules, the first statement is true: a course has many students.

This means that the course ID goes into the student table.

Student ( student ID , course ID , student name, fees paid, date of birth, address) Subject ( subject ID , subject name) Teacher ( teacher ID , teacher name, teacher address) Course ( course ID , course name)

I’ve italicised it to indicate it is a foreign key – a value that links to a primary key in another table.

When we actually populate our tables, instead of having the course name in the student table, the course ID goes in the student table. The course name can then be linked using this ID.

This also means that the course name is stored in one place only, and can be added/removed/updated without impacting other tables.

I’ve created a YouTube video to explain how to identify and diagram one-to-many relationships like this:

We’ve linked the student to the course. Now let’s look at the teacher.

How are teachers related? Depending on the scenario, they could be related in one of a few ways:

  • A student can have one teacher that teaches them all subjects
  • A subject could have a teacher than teaches it
  • A course could have a teacher that teaches all subjects in a course

In our scenario, a teacher is related to a course. We need to relate these two tables using a foreign key.

Does a teacher have many courses, or does a course have many teachers?

In our scenario, the first statement is true. So the teacher ID goes into the course table:

Student ( student ID , course ID , student name, fees paid, date of birth, address) Subject ( subject ID , subject name) Teacher ( teacher ID , teacher name, teacher address) Course ( course ID , teacher ID , course name)

The table data would look like this:

This allows us to change the teacher’s information without impacting the courses or students.

Student and Subject

So we’ve linked the course, teacher, and student tables together so far.

What about the subject table?

Does a subject have many students, or does a student have many subjects?

The answer is both.

How is that possible?

A student can be enrolled in many subjects at a time, and a subject can have many students in it.

How can we represent that? We could try to put one table’s ID in the other table:

But if we do this, we’re storing many pieces of information in one column, possibly separated by commas.

This makes it hard to maintain and is very prone to errors.

If we have this kind of relationship, one that goes both ways, it’s called a many to many relationship . It means that many of one record is related to many of the other record.

Many to Many Relationships

A many to many relationship is common in databases. Some examples where it can happen are:

  • Students and subjects
  • Employees and companies (an employee can have many jobs at different companies, and a company has many employees)
  • Actors and movies (an actor is in multiple movies, and a movie has multiple actors)

If we can’t represent this relationship by putting a foreign key in each table, how can we represent it?

We use a joining table.

This is a table that is created purely for storing the relationships between the two tables.

It works like this. Here are our two tables:

Student ( student ID , course ID , student name, fees paid, date of birth, address) Subject ( subject ID , subject name)

And here is our joining table:

Subject_Student ( student ID , subject ID )

It has two columns. Student ID is a foreign key to the student table, and subject ID is a foreign key to the subject table.

Each record in the row would look like this:

Each row represents a relationship between a student and a subject.

Student 1 is linked to subject 1.

Student 1 is linked to subject 2.

Student 2 is linked to subject 2.

This has several advantages:

  • It allows us to store many subjects for each student, and many students for each subject.
  • It separates the data that describes the records (subject name, student name, address, etc.) from the relationship of the records (linking ID to ID).
  • It allows us to add and remove relationships easily.
  • It allows us to add more information about the relationship. We could add an enrolment date, for example, to this table, to capture when a student enrolled in a subject.

You might be wondering, how do we see the data if it’s in multiple tables? How can we see the student name and the name of the subjects they are enrolled in?

Well, that’s where the magic of SQL comes in. We use a SELECT query with JOINs to show the data we need. But that’s outside the scope of this article – you can read the articles on my Oracle Database page to find out more about writing SQL.

One final thing I have seen added to these joining tables is a primary key of its own. An ID field that   represents the record. This is an optional step – a primary key on a single new column works in a similar way to defining the primary key on the two ID columns. I’ll leave it out in this example.

So, our final table structure looks like this:

Student ( student ID , course ID , student name, fees paid, date of birth, address) Subject ( subject ID , subject name) Subject Enrolment ( student ID , subject ID ) Teacher ( teacher ID , teacher name, teacher address) Course ( course ID , teacher ID , course name)

I’ve called the table Subject Enrolment. I could have left it as the concatenation of both of the related tables (student subject), but I feel it’s better to rename the table to what it actually captures – the fact a student has enrolled in a subject. This is something I recommend in my SQL Best Practices post .

I’ve also underlined both columns in this table, as they represent the primary key. They can also represent a foreign key, which is why they are also italicised.

An ERD of these tables looks like this:

Normalisation ERD Examples - 2NF

This database structure is in second normal form. We almost have a normalised database.

Now, let’s take a look at third normal form.

What Is Third Normal Form?

Third normal form is the final stage of the most common normalization process. The rule for this is:

  • Fulfils the requirements of second normal form
  • Has no transitive functional dependency

What does this even mean? What is a transitive functional dependency?

It means that every attribute that is not the primary key must depend on the primary key and the primary key only .

For example:

  • Column A determines column B
  • Column B determines column C
  • Therefore, column A determines C

This means that column A determines column B which determines column C . This is a transitive functional dependency, and it should be removed. Column C should be in a separate table.

We need to check if this is the case for any of our tables.

Student ( student ID , course ID , student name, fees paid, date of birth, address)

Do any of the non-primary-key fields depend on something other than the primary key?

No, none of them do. However, if we look at the address, we can see something interesting:

We can see that there is a relationship between the ZIP code and the city or suburb. This is common with addresses, and you may have noticed this if you have filled out any forms for addresses online recently.

How are they related? The ZIP code, or postal code, determines the city, state, and suburb.

In this case, 56128 is South Boston, and 56125 is North Boston. (I just made this up so this is probably inaccurate data).

This falls into the pattern we mentioned earlier: A determines B which determines C.

Student determines the address ZIP code which determines the suburb.

So, how can we improve this?

We can move the ZIP code to another table, along with everything it identifies, and link to it from the student table.

Our table could look like this:

Student ( student ID , course ID , student name, fees paid, date of birth, street address, address code ID ) Address Code ( address code ID , ZIP code, suburb, city, state)

I’ve created a new table called Address Code, and linked it to the student table. I created a new column for the address code ID, because the ZIP code may refer to more than one suburb. This way we can capture that fact, and it’s in a separate table to make sure it’s only stored in one place.

Let’s take a look at the other tables:

Subject ( subject ID , subject name) Subject Enrolment ( student ID , subject ID )

Both of these tables have no columns that aren’t dependent on the primary key.

Teacher ( teacher ID , teacher name, teacher address)

The teacher table also has the same issue as the student table when we look at the address. We can, and should use the same approach for storing address.

So our table would look like this:

Teacher ( teacher ID , teacher name, street address, address code ID ) Address Code ( address code ID , ZIP code, suburb, city, state)

It uses the same Address Code table as mentioned above. We aren’t creating a new address code table.

Finally, the course table:

Course ( course ID , teacher ID , course name)

This table is OK. The course name is dependent on the course ID.

So, what does our database look like now?

Student ( student ID , course ID , student name, fees paid, date of birth, street address, address code ID ) Address Code ( address code ID , ZIP code, suburb, city, state) Subject ( subject ID , subject name) Subject Enrolment ( student ID , subject ID ) Teacher ( teacher ID , teacher name, street address, address code ID ) Course ( course ID , teacher ID , course name)

So, that’s how third normal form could look if we had this example.

An ERD of third normal form would look like this:

Normalisation ERD Examples - 3NF

Stopping at Third Normal Form

For most database normalisation exercises, stopping after achieving Third Normal Form is enough .

It satisfies a good relationship rules and will greatly improve your data structure from having no normalisation at all.

There are a couple of steps after third normal form that are optional. I’ll explain them here so you can learn what they are.

Fourth Normal Form and Beyond

Fourth normal form is the next step after third normal form.

What does it mean?

It needs to satisfy two conditions:

  • Meet the criteria of third normal form.
  • There are no non-trivial multivalued dependencies other than a candidate key.

So, what does this mean?

A multivalued dependency is probably better explained with an example, which I’ll show you shortly. It means that there are other attributes in the table that are not dependent on the primary key, and can be moved to another table.

Our database looks like this:

This meets the third normal form rules.

However, let’s take a look at the address fields: street address and address code.

  • Both the student and teacher table have these
  • What if a student moves addresses? Do we update the address in this field? If we do, then we lose the old address.
  • If an address is updated, is it because they moved? Or is it because there was an error in the old address?
  • What if two students have the same street address. Are they actually at the same address? What if we update one and not the other?
  • What if a teacher and a student are at the same address?
  • What if we want to capture a student or a teacher having multiple addresses (for example, postal and residential)?

There are a lot of “what if” questions here. There is a way we can resolve them and improve the quality of the data.

This is a multivalued dependency.

We can solve this by moving the address to a separate table .

The address can then be linked to the teacher and student tables.

Let’s start with the address table.

Address ( address ID , street address, address code ID )

In this table, we have a primary key of address ID, and we have stored the street address here. The address code table stays the same.

We need to link this to the student and teacher tables. How do we do this?

Do we also want to capture the fact that a student or teacher can have multiple addresses? It may be a good idea to future proof the design. It’s something you would want to confirm in your organisation.

For this example, we will design it so there can be multiple addresses for a single student.

Our tables could look like this:

Student ( student ID , course ID , student name, fees paid, date of birth) Address ( address ID , street address, address code ID ) Address Code ( address code ID , ZIP code, suburb, city, state) Student Address ( address ID, student ID ) Subject ( subject ID , subject name) Subject Enrolment ( student ID , subject ID ) Teacher ( teacher ID , teacher name) Teacher Address ( teacher ID, address ID ) Course ( course ID , teacher ID , course name)

An ERD would look like this:

Normalisation ERD Examples - 4NF

A few changes have been made here:

  • The address code ID has been removed from the Student table, because the relationships between student and address is now captured in the joining table called Student Address.
  • The teacher’s address is also captured in the joining table Teacher Address, and the address code ID has been removed from the Teacher table. I couldn’t think of a better name for each of these tables.
  • Address still links to address code ID

This design will let you do a few things:

  • Store multiple addresses for each student or teacher
  • Store additional information about an address relationship to a teach or student, such as an effective date (to track movements over time) or an address type (postal, residential)
  • Determine which students or teachers live at the same address with certainty (it’s linked to the same record).

So, that’s how you can achieve fourth normal form on this database.

There are a few enhancements you can make to this design, but it depends on your business rules:

  • Combine the student and teacher tables into a person table, as they are both effectively people, but teachers teach a class and students take a class. This table could then link to subjects and define that relationship as “teaches” or “is enrolled in”, to cater for this.
  • Relate a course to a subject, so you can see which subjects are in a course
  • Split the address into separate fields for unit number, street number, address line 1, address line 2, and so on.
  • Split the student name and teacher name into first and last names to help with displaying data and sorting.

These changes could improve the design, but I haven’t detailed them in any of these steps as they aren’t required for fourth normal form.

I hope this explanation has helped you understand what the normal forms are and what normalization in DBMS is. Do you have any questions on this process? Share them in the section below.

Lastly, if you enjoy the information and career advice I’ve been providing,  sign up to my newsletter below  to stay up-to-date on my articles. You’ll also receive a fantastic bonus. Thanks!

68 thoughts on “Database Normalization: A Step-By-Step-Guide With Examples”

essay on database normalization

Thanks so much for explaining this concept Ben. To me as a learner, this is the best way to grab this concept.

You broke this down to the last atom. Keep up the good work!

essay on database normalization

Thanks James, glad you found it helpful!

essay on database normalization

Absolutely, the best and easiest explanation I have seen. Very helpful.

essay on database normalization

Saludos Ben, buen post. Podrías por favor revisar la simbología que utilizaste en la relación de las tablas Student y Course, dado que comentaste en las líneas de arriba “significa que la identificación del curso entra en la tabla de estudiantes.” Comenta si la relación sería: Course -< Student

Hi Ronald, Sure, I’ll check this and update the post. (Google Translate: Greetings Ben, good post. Could you please check the symbology you used in the Student and Course table relationship, since you commented on the lines above “it means that the course identification enters the student table.” Comment if the relationship would be: Course – < Student)

Saludos Ben, la simbología de Courses y Student según planteaste es de “1 a n” verifica si sería ” -< " Buen post.

Thanks Ronald! (Google Translate: Greetings Ben, the symbology of Courses and Student as you raised is “1 to n” verifies if it would be “- <" Good post.)

essay on database normalization

thank you for sharing these things to us , damn i really love it. You guys are really awesome

essay on database normalization

i dont understand the second normal form linking student id to course id isnt clear

essay on database normalization

Excellent working example

Glad you found it useful!

essay on database normalization

This is a nice compilation and a good illustration of how to carry out table normalization. I wish you can provide script samples that represent the ERD you presented here. It will be so much helpful.

Hi DJ, Glad you like the article. Good idea – I’ll create some sample scripts and add them to the post. Thanks, Ben

essay on database normalization

Good job! This is a great way explaining this topic. You made it look easy to understand. But, one question I have for you is where is a best scenario in real life used the fourth normal form?

Hi Nati, Thanks, I’m glad you like the article. I’m not sure what would be a realistic example of using fourth normal form. I think many 3NF databases can be turned into 4NF, depending on how to want to record values (either a value in a table or an ID to another table). I haven’t used it that often.

essay on database normalization

Dear Sir’ Can we call the Fourth Normal Form as a BCNF (Boyce Codd Normal Form). Or not?

Hi Subzar, I think Fourth Normal Form is slightly different to BCNF. I haven’t used either method but I know they are a little different. I think 4NF comes after BCNF.

essay on database normalization

Hey Ben, Your notes are too helpful. Will recommend my other friends for sure. Thanks a lot :)

essay on database normalization

You’ve done a truly superb job on the majority of this. You’ve used far more details than most people who provide examples do.

Particularly good is the splitting of addresses into a separate table. That is just not done nearly enough.

However, two tables have critical problems: (1) The Student table should not contain Course ID (nor fees paid); there should be a separate Student_Course intersection table. (2) Similarly, the Course table should not have a Teacher ID, likewise a separate Course_Teacher intersection table should be created.

Reasoning (condensed): (1) A student must be able to enroll first, without yet specificying a course. The student’s name and other enrollment data are not related to any specific course. (2) A course has data of its own not related to the teacher: # of credit/lab hours, cost(s), first term offered, last term offered, etc.. Should the only teacher currently teaching a course withdraw from it, the course data should not be lost. Courses have prerequisites, sometimes complex ones, that have nothing to do with who is teaching the course.

Thanks for the feedback Scott and glad you like the post! I understand your reasoning, and yes ideally there would be those two intersection tables. These scenarios are things that we, as developers, would clarify during the design. The example I went with was a simple one where a student must have a course and a course must have a teacher, but in the real world it would cater to your scenarios. Thanks again, Ben

essay on database normalization

Simple & powerfully conveyed. Thank you.

Glad you found it helpful!

essay on database normalization

As a grade 11 teacher, I am well aware of the complexities students face when teaching/explaining the concepts within this topic. This article is brilliant and breaks down such a confusing topic quite nicely with examples along the way. My students will definitely benefit from this, thank you so much.

essay on database normalization

I am just starting out in SQL and this was quite helpful. Thanks.

essay on database normalization

what a tremendous insights about the normalisation you have explained and its gave me a lot of confedence Thank u some much ben my deepest gratitude for sharing knowledge . you truly proved sharing is caring with regards chandu, india

essay on database normalization

Came across your material while searching for Normalisation material, as wanting to use my time to improve my Club Membership records, having gained a ‘Database qualification’ some 20 to 30 years years ago, I think, I needed to refresh my memory! Anyway – some queries:

1. Shouldn’t the Student Name be broken down or decomposed into StudentForename and StudentSurname, since doesn’t 1NF require this? 2. Shouldn’t Teacher Name be converted as per 121 above as well?

This would enable records to be retrieved using Surname or Forename

Hi Tim, yes that’s a good point and it would be better to break it into two names for that purpose.

essay on database normalization

Wow, this is the very first website i finally thoroughly understood normalization. Thanks a lot.

Hello again

I have thought thru the data I need to record, I think, time will tell I suspect. Anyway we run upto 18 competitions, played by teams of 1, 2 3 or 4 members, thus I think that there may be Many to Many relationship between Member and Competition tables, as in Many Member records may be related to Many Competition records [potential for a Pairs or Triples or Fours teams to win all the Competitions they enter], am I correct?

Also should I design the Competition table as CompID, Year, Comp1, Comp 2, Comp3, each Header having the name of the Competition, then I presume a table that links the two, along the lines of:

CompID, Year, MemberID OR MemberID, Year, CompID

Regards, Tim

Hello again, thinking further, I presume that I could create 18 tables, one per Competition to capture the annual results.

Again though, presume my single Comp table (see above) shouldn’t have a column per comp, as this is a repeating group

So do I create a ‘joining table’, that records Year and Comp, another that records Member and Comp and one that records Member and Year.

I may be over thinking it, but as No Comp this year, I think I need to be able to record this, I think

Hi Tim, good question. You could create one table per competition, but you would have 18 tables which have the same structure and different data, which is quite repetitive. I would suggest having a single table for comp, and a column that indicates which competition it is for. It’s OK for a table to have a value that is repeated like this, as it identifies which competition something refers to. Yes you could create joining tables for Year and Comp (if there is a many to many relationship between them) and Member and Comp as well. What’s the relationship between Member and Year?

Thanks for reply, however, would it be easier to say create a Comp table of 18 records, a Comp Type table which has 2 records, that is Annual and One Day, another table for Comp Year, which will record the annual competition results based on:

CompYear – CompTypeID – CompID – MemID

thus each year would create 32 records [10 x 1 + 4 x2 + 2 x3 + 2 x 4], e.g

2019 – 1 – 4 – 1 2019 – 1 – 4 – 2 2019 – 1 – 4 – 3 2019 – 1 – 4 – 4 2019 – 1 – 3 – 1 2019 – 1 – 3 – 2 2019 – 1 – 3 – 3 2019 – 2 – 1 – 1

so members 1 to 4 won the 2019 one day comp Bickford Fours; and members 1 to 3 won the 2019 one day comp Arthur Johnson Triples; and member 1 won the 2019 annual singles championship

Would this work and have I had a light bulb moment?

Hi Tim, yes I think that would work! Storing the comp data separately from the comp type (and so on) will ensure the data is not repeated and is only stored in one place. Good luck!

essay on database normalization

Ben, Thank you so much! I was only able to grasp the concept of normalization in one hour or so because how you simplified the concepts through a simple running example.

essay on database normalization

Thanks a lot sir Daniel i have really understood this you are a great teacher

essay on database normalization

Firstly :Thank you for your website. Secondly: I have still problem with understanding the Second normal form. For example you have written : “student name: Yes, this is dependent on the primary key. A different student ID means a different student name.”. So in your design I am not allowed to have 2 students with same names? What will happen when this, not so uncommon situation occurs?

Glad you like it Wojtek! Yes, in this design you can have two students with the same name. They would have different IDs and different rows in the table, as two people with the same name is common like you mentioned.

essay on database normalization

This is the best explanation on why and how to normalize Tables… excellent work, maybe the best explanation out there….

essay on database normalization

Hi, Thanks for the post. That is exactly what I was looking for. But I have a question, how would I insert into student address and teacher address. Best regards

essay on database normalization

This is amazing, very well explained. Simple example made it easy to understand. Thank you so much!

essay on database normalization

Thanks sir, this very helpful

essay on database normalization

Sorry but your example is not in 1NF. 1NF dictates that you cannot have a repeating group. The subjects are a repeating group. Not saying you got the design wrong at the end just that you failed to remove the repeating group in 1NF. When you went to 2NF you made rules for the repeating group multiple times and took care of it but it should have been taken care of in 1NF.

essay on database normalization

Its so helpful easy to understand

essay on database normalization

This is great! You make it easy and simple to learn where I can understand.

essay on database normalization

If you”re going to break address into atoms, (unit, street, city, zip, etc.), would you also break down telephone numbers, (country code, area code, exchange, unit number) into separate fields in another table? How elemental do you go?

essay on database normalization

Clearly explained :)

essay on database normalization

Wow c’est vraiment très utile, ça vient d’apporter un plus ma connaissance. Vraiment merci beaucoup pour cet article.

essay on database normalization

Great post! Just wanted to check – for the ERD diagram in the 2NF example, wouldn’t the relationship between Course and Student be the other way around? A course can have many students, but a student can only be enrolled in one course. So the crows feet symbol should be on the Student table.

essay on database normalization

Hi Ben, I am hoping you can respond to the question above. I noticed the same and I was hoping to find a correction or explanation in the notes. I second everyone else’s compliments on a great post, too.

essay on database normalization

From table Student (student ID, course ID, student name, fees paid, date of birth, street address, address code ID) why you put “street address” column remember that “address code ID” is available idenitify address of student i dont understand here

essay on database normalization

Hi Ben ! ive got a question for you. i didnt understand this part:

So that means in a first normal form duplicate roes are allowed?

essay on database normalization

I stumbled on this article explaining so succinctly about normalization of database. I must admit that it helped me understand the concept using an example far much better than just theory.

Kudos to you @Ben – You are really a DB star.

Will recommend it to my friends. I am in the process of putting this knowledge into an Employee Management System.

essay on database normalization

Mmnh ! Thank you sir, it’s very awesome tutor. 👍

essay on database normalization

The model breaks as soon as you add a 2nd course for a student. CourseID should not be in the student table. You need a separate table that ties studentID and courseID together

That’s a good point. Yes you would need a separate table for that scenario.

essay on database normalization

The ERDs for the 2nd / 3rd / 4th normal form show the crows foot at the COURSE table, but the STUDENT table receives the FK course_id. The crows foot represents the many end of the relationship, but there is no representation in the course table of any student data point.

Imho, this is where it is broken in addition to what Dan has pointed to.

essay on database normalization

Loved it – easy explanation. Thank you – love from India

essay on database normalization

Well done Ben, this is the best normalization tutorial I have ever seen. Thanks so much. keep up the good work.

essay on database normalization

Thanks for taking the time to create this. This is one of the best breakdowns that I’ve read regarding DB Normalization. However, I can’t really wrap my mind around 4NF as I would like to know how to use it in a real life scenario. Thanks again as this was extremely helpful to me!

essay on database normalization

Thank you . Very clear explanation

essay on database normalization

If the statement is true that each student can only have 1 course, then the relationship is not shown correctly in the ERD.

essay on database normalization

Hi, I don’t quite understand how your example satisfies third normal form. It seems that street address could depend on address ID, which depends upon the primary key. In other words, if address ID were to change, I would assume street address would also have to change. So doesn’t that create transitive dependency?

ii) I also don’t understand why you made two address tables, couldn’t they be combined into one?

essay on database normalization

Hi ben thanks for superb work.

What I always think that how to determine functional dependency? In this example , After 1NF, we know that course is not functionally dependent on student id because it is day to day example & it make sense.

however While designing DB for client whose domain is unknown to us, how to know effectively the functional dependency from client? What question we should ask as layman?

essay on database normalization

Thank you very much

essay on database normalization

This is my last stop to understanding data normalisation after two years of searching and searching. Thousands of tons of thanks Ben.

essay on database normalization

thanks so much sir have really enjoyed the class and gain alot

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed .

The Agile Data (AD) Method

Agile Data Logo

  • Enterprise Architect
  • Vertical Slicing
  • Lean Data Governance

></center></p><h2>Introduction to Data Normalization: Database Design 101</h2><ul><li>Why Data Normalization?</li><li>First Normal Form (1NF)</li><li>Second Normal Form (2NF)</li><li>Third Normal Form (3NF)</li><li>Denormalization</li><li>Acknowledgements</li></ul><h2>1. Why Data Normalization?</h2><p>There are two primary advantages of having a highly normalized data schema:</p><ul><li>Increased consistency . Information is stored in one place and one place only, reducing the possibility of inconsistent data.</li><li>Easier object-to-data mapping . Highly-normalized data schemas in general are closer conceptually to object-oriented schemas because the object-oriented goals of promoting high cohesion and loose coupling between classes results in similar solutions (at least from a data point of view).</li></ul><p>You typically want to have highly normalized operational data stores (ODSs) and data warehouses (DWs).</p><p>The primary disadvantage of normalization is slower reporting performance. You will want to have a  denormalized  schema to support reporting, particularly in data marts.</p><h2>2. The Steps of Data Normalization</h2><p>Table 1  summarizes the three most common forms of normalization (  First normal form (1NF) ,  Second normal form (2NF) , and  Third normal form (3NF) ) describing how to put entity types into a series of increasing levels of normalization. Higher levels of data normalization are beyond the scope of this article. With respect to terminology, a data schema is considered to be at the level of normalization of its least normalized entity type. For example, if all of your entity types are at second normal form (2NF) or higher then we say that your data schema is at 2NF.</p><p>Table 1. Data Normalization Rules.</p><h2>2.1. Data Normalization: First Normal Form (1NF)</h2><p>Let’s consider an example. An entity type is in first normal form (1NF) when it contains no repeating groups of data. For example, in  Figure 1  you see that there are several repeating attributes in the data  Order0NF  table – the ordered item information repeats nine times and the contact information is repeated twice, once for shipping information and once for billing information. Although this initial version of orders could work, what happens when an order has more than nine order items? Do you create additional order records for them? What about the vast majority of orders that only have one or two items? Do we really want to waste all that storage space in the database for the empty fields? Likely not. Furthermore, do you want to write the code required to process the nine copies of item information, even if it is only to marshal it back and forth between the appropriate number of objects. Once again, likely not.</p><p>Figure 1. An Initial Data Schema for Order ( UML Notation ).</p><p>Figure 2  presents a reworked data schema where the order schema is put in first normal form. The introduction of the  OrderItem1NF  table enables us to have as many, or as few, order items associated with an order, increasing the flexibility of our schema while reducing storage requirements for small orders (the majority of our business). The  ContactInformation1NF  table offers a similar benefit, when an order is shipped and billed to the same person (once again the majority of cases) we could use the same contact information record in the database to reduce data redundancy.  OrderPayment1NF  was introduced to enable customers to make several payments against an order –  Order0NF  could accept up to two payments, the type being something like “MC” and the description “MasterCard Payment”, although with the new approach far more than two payments could be supported, potentially one per payment type. Multiple payments are accepted only when the total of an order is large enough that a customer must pay via more than one approach, perhaps paying some by check and some by credit card. Figure 2. An Order Data Schema in 1NF ( UML Notation ).</p><h2>2.2. Data Normalization: Second Normal Form (2NF)</h2><p>Although the solution presented in  Figure 2  is improved over that of  Figure 1 , it can be normalized further.  Figure 3  presents the data schema of  Figure 2  in second normal form (2NF). an entity type is in second normal form (2NF) when it is in 1NF and when every non-key attribute, any attribute that is not part of the primary key, is fully dependent on the primary key. This was definitely not the case with the  OrderItem1NF  table, therefore we need to introduce the new table  Item2NF . The problem with  OrderItem1NF  is that item information, such as the name and price of an item, do not depend upon an order for that item. For example, if Hal Jordan orders three widgets and Oliver Queen orders five widgets, the facts that the item is called a “widget” and that the unit price is $19.95 is constant. This information depends on the concept of an item, not the concept of an order for an item, and therefore should not be stored in the order items table – therefore the  Item2NF  table was introduced.  OrderItem2NF  retained the  TotalPriceExtended  column, a calculated value that is the number of items ordered multiplied by the price of the item. The value of the  SubtotalBeforeTax  column within the  Order2NF  table is the total of the values of the total price extended for each of its order items.</p><p>Figure 3. An Order in 2NF ( UML Notation ).</p><h2>2.3. Data Normalization: Third Normal Form (3NF)</h2><p>An entity type is in  third normal form (3NF)  when it is in 2NF and when all of its attributes are directly dependent on the primary key. A better way to word this rule might be that the attributes of an entity type must depend on all portions of the primary key. In this case there is a problem with the  OrderPayment2NF  table, the payment type description (such as “Mastercard” or “Check”) depends only on the payment type, not on the combination of the order id and the payment type. To resolve this problem the  PaymentType3NF  table was introduced in  Figure 4 , containing a description of the payment type as well as a unique identifier for each payment type.</p><p>Figure 4. An Order in 3NF( UML Notation ).</p><h2>2.4. Data Normalization: Beyond 3NF</h2><p>The data schema of  Figure 4  can still be improved upon, at least from the point of view of data redundancy, by removing attributes that can be calculated/derived from other ones. In this case we could remove the  SubtotalBeforeTax  column within the  Order3NF  table and the  TotalPriceExtended  column of  OrderItem3NF , as you see in  Figure 5 .</p><p>Figure 5. An Order Without Calculated Values ( UML Notation ).</p><h2>3. Denormalization</h2><p>From a purist point of view you want to normalize your data structures as much as possible, but from a practical point of view you will find that you need to ‘back out” of some of your normalizations for performance reasons. This is called “denormalization”. For example, with the data schema of  Figure 1  all the data for a single order is stored in one row (assuming orders of up to nine order items), making it very easy to access. With the data schema of  Figure 1  you could quickly determine the total amount of an order by reading the single row from the  Order0NF  table. To do so with the data schema of  Figure 5  you would need to read data from a row in the  Order  table, data from all the rows from the  OrderItem  table for that order and data from the corresponding rows in the  Item  table for each order item. For this query, the data schema of  Figure 1  very likely provides better performance.</p><h2>4. Acknowledgements</h2><p>I’d like to thank Jon Heggland and Nebojsa Trninic for their thoughtful review and feedback. They found several bugs which had gotten by both myself and my tech reviewers.</p><p><center><img style=

Home » Data Modeling / Database » A Comprehensive Guide to Database Normalization with Examples

A Comprehensive Guide to Database Normalization with Examples

  • Posted on September 15, 2023
  • / Under Data Modeling / Database

Introduction

Database normalization is a crucial concept in the world of database management. It is a process that optimizes database structure by reducing data redundancy and improving data integrity. Normalization is a set of rules and guidelines that help organize data efficiently and prevent common data anomalies like update anomalies, insertion anomalies, and deletion anomalies.

In this article, we will delve into the fundamentals of database normalization, the various normal forms, and provide practical examples to illustrate each level of normalization.

Why Normalize a Database?

Before we dive into the details of database normalization, it’s essential to understand why it’s necessary. Normalization offers several advantages:

  • Data Integrity: Normalization helps maintain data accuracy and consistency by reducing redundancy. When data is stored in a non-repetitive manner, it is less prone to errors.
  • Efficient Storage: Normalized databases tend to occupy less storage space as duplicate data is minimized. This reduces the overall cost of storage.
  • Query Optimization: Queries become more efficient in normalized databases because they need to access smaller, well-structured tables instead of large, denormalized ones.
  • Flexibility: Normalized databases are more flexible when it comes to accommodating changes in data requirements or business rules.

Levels of Normalization

essay on database normalization

  • First Normal Form (1NF): Ensures that each column in a table contains atomic, indivisible values. There should be no repeating groups, and each column should have a unique name.
  • Second Normal Form (2NF): Building on 1NF, 2NF eliminates partial dependencies. A table is in 2NF if it’s in 1NF and all non-key attributes are functionally dependent on the entire primary key.
  • Third Normal Form (3NF): Building on 2NF, 3NF eliminates transitive dependencies. A table is in 3NF if it’s in 2NF and all non-key attributes are functionally dependent on the primary key, but not on other non-key attributes.
  • Boyce-Codd Normal Form (BCNF): A stricter version of 3NF, BCNF ensures that every non-trivial functional dependency is a superkey. This means that no partial dependencies or transitive dependencies are allowed.
  • Fourth Normal Form (4NF): 4NF deals with multi-valued dependencies, where an attribute depends on another attribute but is not a function of the primary key.
  • Fifth Normal Form (5NF) or Project-Join Normal Form (PJNF): These forms deal with cases where a table is in 4NF, but there are join dependencies that can be further optimized.

Now, let’s illustrate these normal forms with examples:

First Normal Form (1NF)

Consider an unnormalized table that stores customer orders:

This table violates 1NF because the Products column contains a list of items. To bring it to 1NF, we split the products into separate rows:

Now, each cell contains an atomic value, and the table is in 1NF.

Second Normal Form (2NF)

Consider a table that stores information about students and their courses:

This table violates 2NF because the Instructor attribute depends on both StudentID and CourseID . To achieve 2NF, we split the table into two separate tables:

Students Table:

Courses Table:

Now, the Instructor attribute depends only on the CourseID , and the table is in 2NF.

Third Normal Form (3NF)

Consider a table that stores information about employees and their projects:

This table violates 3NF because the Manager attribute depends on the EmployeeID , not directly on the primary key. To bring it to 3NF, we split the table into two separate tables:

Employees Table:

Projects Table:

EmployeeProjects Table:

Now, the Manager attribute depends on the ProjectID , and the table is in 3NF.

Boyce-Codd Normal Form (BCNF)

BCNF is a stricter version of 3NF. To illustrate BCNF, consider a table that stores information about professors and their research areas:

This table violates BCNF because there is a non-trivial functional dependency between ResearchArea and OfficeNumber (i.e., the office number depends on the research area). To achieve BCNF, we split the table into two separate tables:

Professors Table:

ResearchAreas Table:

ProfessorResearch Table:

Now, the table is in BCNF because there are no non-trivial functional dependencies.

Fourth Normal Form (4NF)

4NF deals with multi-valued dependencies. Consider a table that stores information about books and their authors:

This table violates 4NF because there is a multi-valued dependency between BookID and Authors . To achieve 4NF, we split the table into three separate tables:

Books Table:

Authors Table:

BookAuthors Table:

Now, each table is in 4NF, and multi-valued dependencies are removed.

Fifth Normal Form (5NF) or Project-Join Normal Form (PJNF)

5NF or PJNF deals with join dependencies, which are beyond the scope of this introductory article. Achieving 5NF typically involves further decomposition and is often necessary for complex databases.

Database normalization is a critical process in database design, aimed at optimizing data storage, improving data integrity, and reducing data anomalies. By organizing data into normalized tables, you can enhance the efficiency and maintainability of your database system.

Remember that achieving higher normal forms, such as BCNF and 4NF, may not always be necessary for all databases. The level of normalization depends on the specific requirements of your application and the trade-offs between data integrity and performance.

When designing a database, it’s essential to strike a balance between normalization and practicality. In many cases, achieving 3NF is sufficient to ensure data integrity while maintaining good query performance.

Understanding the principles of normalization and practicing them with real-world examples is crucial for database administrators and developers to create efficient and robust database systems.

Leave a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

essay on database normalization

  • Visual Paradigm Online
  • Request Help
  • Customer Service
  • Community Circle
  • Demo Videos
  • Visual Paradigm
  • YouTube Channel
  • Academic Partnership

DEV Community

DEV Community

Mohammad Alhraki

Posted on Mar 15

Understanding Database Normalization and the Importance of 1NF

What is database normalization.

Database normalization is a fundamental principle in database design. Its purpose is to organize data efficiently, reduce redundancy, and maintain data integrity. By adhering to normalization rules, we create well-structured databases that facilitate data processing and querying.

Why Choose Normalization?

The main reasons for employing database normalization include:

  • Avoiding Complexity : Breaking down data into smaller, related tables simplifies management.
  • Eliminating Redundancy : Ensuring each piece of information appears only once.
  • Consistent Organization : Structuring data according to specific rules.

Types of Normalization

Let's focus on the first three levels of normalization:

1. First Normal Form (1NF)

1NF establishes the foundation for more complex normalization strategies. Here are its key characteristics:

  • Each cell in a table holds only one value (this is called atomicity ).
  • A primary key uniquely identifies each row.
  • No duplicated rows or columns.
  • Each column contains only one value for each row.

2. Second Normal Form (2NF)

2NF builds upon 1NF:

  • All non-key attributes are fully functionally dependent on the entire primary key.
  • Helps eliminate partial dependencies.

3. Third Normal Form (3NF)

3NF further refines the design:

  • All non-key attributes are directly dependent on the primary key.
  • Helps eliminate transitive dependencies.

Examples of 1NF, 2NF, and 3NF

1nf example.

Consider an employee database with the following table:

In this 1NF-compliant table:

  • Each cell holds a single value.
  • The primary key is the Employee ID .
  • There are no duplicate rows or columns.

2NF Example

Suppose we have another table for employee data:

In this 2NF-compliant table:

3NF Example

Continuing with the employee_data table, let's further normalize it:

In this 3NF-compliant structure:

  • We've separated the Department information into a separate table.
  • Transitive dependencies are eliminated.

Why We Love 1NF

  • Data Integrity : 1NF ensures that each piece of data is atomic and indivisible. This prevents anomalies during data insertion, update, or deletion.
  • Efficient Queries : With 1NF, querying data becomes straightforward. There's no need to deal with complex nested structures or repeating groups.
  • Simplicity : 1NF simplifies data management. Each value has a clear place, making the database easier to understand and maintain.
  • Consistency : By adhering to 1NF, we achieve consistent data organization across tables, leading to better overall system reliability.

In summary, 1NF sets the groundwork for a well-organized, efficient, and reliable database. It's the first step toward creating a robust data structure.

In conclusion

Database normalization is not just a theoretical concept; it's a practical approach to designing robust databases that stand the test of time. By implementing the principles of 1NF, 2NF, and 3NF, we lay the groundwork for databases that are logical, scalable, and efficient. These normalization forms help us avoid data anomalies, streamline database operations, and ensure that our data remains consistent and reliable.

As we advance in the digital age, the importance of well-structured data cannot be overstated. Normalization is a key tool in our arsenal, enabling us to handle the ever-increasing volumes of data with grace and precision. Whether you're a database designer, developer, or administrator, embracing normalization is a step towards creating data systems that are not only functional but also future-proof.

In essence, normalization is the cornerstone of database excellence, and its disciplined application is a hallmark of quality in data management. It's a journey worth taking for anyone vested in the integrity and performance of their data systems.

Top comments (0)

pic

Templates let you quickly answer FAQs or store snippets for re-use.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink .

Hide child comments as well

For further actions, you may consider blocking this person and/or reporting abuse

betaziliani profile image

Yes, Ruby is fast, but…

Beta Ziliani - May 9

shiviyer profile image

Tips and Tricks for Troubleshooting ClickHouse Wait Events

Shiv Iyer - Apr 28

r4nd3l profile image

Front-end specialisations: Design systems

Matt Miller - Apr 28

peboy profile image

How to Create the Perfect Transparent Login Form

Adeyele Paul - May 9

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Normal Forms and Normalization

  • Reference work entry
  • pp 1917–1920
  • Cite this reference work entry

essay on database normalization

  • Marcelo Arenas 3  

188 Accesses

A normal form defines a condition over a set of data dependencies, or semantic constraints, that has been specified as part of a database schema. This condition is used to check whether the design of the database has some desirable properties (for example, the database does not store redundant information), and if this is not the case, then it can also be used to convert the poorly designed database into an equivalent well-designed one.

Historical Background

Information is one of the most – if not the most – valuable assets of a company. Therefore, organizations need tools to allow them to structure, query and analyze their data, and, in particular, they need tools providing simple and fast access to their information.

During the last 30 years, relational databases have become the most popular computer application for storing and analyzing information. The simplicity and elegance of the relational model, where information is stored just as tables, has largely contributed to...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Abiteboul S., Hull R., and Vianu V. Foundations of Databases. Addison-Wesley, Reading, MA, USA, 1995.

MATH   Google Scholar  

Codd E.F. Further normalization of the data base relational model. In Data base systems. Prentice-Hall, Englewood Cliffs, NJ, USA, 1972, pp. 33–64.

Google Scholar  

Codd E.F. Recent investigations in relational data base systems. In IFIP Congress. North-Holland, Amsterdam, 1974, pp. 1017–1021.

Fagin R. Multivalued dependencies and a new normal form for relational databases. ACM Trans. Database Syst., 2(3):262–278, 1977.

Article   MathSciNet   Google Scholar  

Fagin R. Normal forms and relational database operators. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1979, pp. 153–160.

Fagin R. A normal form for relational databases that is based on domians and keys. ACM Trans. Database Syst., 6(3):387–415, 1981.

Article   MATH   Google Scholar  

Hull R. Relative information capacity of simple relational database schemata. SIAM J. Comput., 15(3):856–886, 1986.

Article   MATH   MathSciNet   Google Scholar  

Vincent M. A corrected 5NF definition for relational database design. Theor. Comput. Sci., 185(2):379–391, 1997.

Download references

Author information

Authors and affiliations.

Pontifical Catholic University of Chile, Santiago, Chile

Marcelo Arenas

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

College of Computing, Georgia Institute of Technology, 266 Ferst Drive, 30332-0765, Atlanta, GA, USA

LING LIU ( Professor ) ( Professor )

Database Research Group David R. Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, N2L 3G1, Waterloo, ON, Canada

M. TAMER ÖZSU ( Professor and Director, University Research Chair ) ( Professor and Director, University Research Chair )

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this entry

Cite this entry.

Arenas, M. (2009). Normal Forms and Normalization. In: LIU, L., ÖZSU, M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_1237

Download citation

DOI : https://doi.org/10.1007/978-0-387-39940-9_1237

Publisher Name : Springer, Boston, MA

Print ISBN : 978-0-387-35544-3

Online ISBN : 978-0-387-39940-9

eBook Packages : Computer Science Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

{{ activeMenu.name }}

  • Python Courses
  • JavaScript Courses
  • Artificial Intelligence Courses
  • Data Science Courses
  • React Courses
  • Ethical Hacking Courses
  • View All Courses

Fresh Articles

TripleTen Data Science Bootcamp: Insider Review

  • Python Projects
  • JavaScript Projects
  • Java Projects
  • HTML Projects
  • C++ Projects
  • PHP Projects
  • View All Projects

How To Create An Interactive Photo Gallery Using HTML

  • Python Certifications
  • JavaScript Certifications
  • Linux Certifications
  • Data Science Certifications
  • Data Analytics Certifications
  • Cybersecurity Certifications
  • View All Certifications

DataCamp’s Certifications To Get You Job-Ready: Insider Review

  • IDEs & Editors
  • Web Development
  • Frameworks & Libraries
  • View All Programming
  • View All Development
  • App Development
  • Game Development
  • Courses, Books, & Certifications
  • Data Science
  • Data Analytics
  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • View All Data, Analysis, & AI
  • Networking & Security
  • Cloud, DevOps, & Systems
  • Recommendations
  • Crypto, Web3, & Blockchain
  • User-Submitted Tutorials
  • View All Blog Content
  • JavaScript Online Compiler
  • HTML & CSS Online Compiler
  • Certifications
  • Programming
  • Development
  • Data, Analysis, & AI
  • Online JavaScript Compiler
  • Online HTML Compiler

Don't have an account? Sign up

Forgot your password?

Already have an account? Login

Have you read our submission guidelines?

Go back to Sign In

essay on database normalization

Normalization in DBMS: 1NF, 2NF, 3NF, and BCNF [Examples]

When developing the schema of a relational database, one of the most important aspects to be taken into account is to ensure that the duplication of data is minimized. We do this by carrying out database normalization, an important part of the database schema design process.

Here, we explain normalization in DBMS, explaining 1NF, 2NF, 3NF, and BCNF with explanations. First, let’s take a look at what normalization is and why it is important.

  • What is Normalization in DBMS?

Database normalization is a technique that helps design the schema of the database in an optimal way. The core idea of database normalization is to divide the tables into smaller subtables and store pointers to data rather than replicating it. 

Why Do We Carry out Database Normalization?

Types of Normal Form

There are two primary reasons why database normalization is used. First, it helps reduce the amount of storage needed to store the data. Second, it prevents data conflicts that may creep in because of the existence of multiple copies of the same data.

If a database isn’t normalized, then it can result in less efficient and generally slower systems, and potentially even inaccurate data. It may also lead to excessive disk I/O usage and bad performance. 

  • What is a Key?

You should also be aware of what a key is. A key is an attribute that helps identify a row in a table. There are seven different types, which you’ll see used in the explanation of the various normalizations:

  • Candidate Key
  • Primary Key
  • Foreign Key
  • Alternate Key
  • Composite Key
  • Database Normalization Example

To understand (DBMS)normalization with example tables, let's assume that we are storing the details of courses and instructors in a university. Here is what a sample database could look like:

Here, the data basically stores the course code, course venue, instructor name, and instructor’s phone number. At first, this design seems to be good. However, issues start to develop once we need to modify information. For instance, suppose, if Prof. George changed his mobile number. In such a situation, we will have to make edits in 2 places. 

What if someone just edited the mobile number against CS101, but forgot to edit it for CS154? This will lead to stale/wrong information in the database. This problem can be easily tackled by dividing our table into 2 simpler tables:

Table 1 (Instructor):

  • Instructor ID
  • Instructor Name
  • Instructor mobile number

Table 2 (Course):

  • Course code
  • Course venue

Now, our data will look like the following:

Basically, we store the instructors separately and in the course table, we do not store the entire data of the instructor. Rather, we store the ID of the instructor. Now, if someone wants to know the mobile number of the instructor, they can simply look up the instructor table. Also, if we were to change the mobile number of Prof. George, it can be done in exactly one place. This avoids the stale/wrong data problem.

Further, if you observe, the mobile number now need not be stored 2 times. We have stored it in just 1 place. This also saves storage. This may not be obvious in the above simple example. However, think about the case when there are hundreds of courses and instructors and for each instructor, we have to store not just the mobile number, but also other details like office address, email address, specialization, availability, etc. In such a situation, replicating so much data will increase the storage requirement unnecessarily.

Suggested Course

Database Management System (DBMS) & SQL : Complete Pack 2024

  • Types of DBMS Normalization

There are various normal forms in DBMS. Each normal form has an importance that helps optimize the database to save storage and reduce redundancies. We explain normalization in DBMS with examples below.

First Normal Form (1NF)

The first normal form simply says that each cell of a table should contain exactly one value. Assume we are storing the courses that a particular instructor takes, we can store it like this:

Here, the issue is that in the first row, we are storing 2 courses against Prof. George. This isn’t the optimal way since that’s now how SQL databases are designed to be used. A better method would be to store the courses separately. For instance:

This way, if we want to edit some information related to CS101, we do not have to touch the data corresponding to CS154. Also, observe that each row stores unique information. There is no repetition. This is the First Normal Form.

Data redundancy is higher in 1NF because there are multiple columns with the same in multiple rows. 1NF is not so focused on eliminating redundancy as much as it is focused on eliminating repeating groups. 

Second Normal Form (2NF)

For a table to be in second normal form, the following 2 conditions must be met:

  • The table should be in the first normal form.
  • The primary key of the table should have exactly 1 column.

The first point is obviously straightforward since we just studied 1NF. Let us understand the second point: a 1-column primary key. A primary key is a set of columns that uniquely identifies a row. Here, no 2 rows have the same primary keys. 

In this table, the course code is unique so that becomes our primary key. Let us take another example of storing student enrollment in various courses. Each student may enroll in multiple courses. Similarly, each course may have multiple enrollments. A sample table may look like this (student name and course code):

Here, the first column is the student name and the second column is the course taken by the student. 

Clearly, the student name column isn’t unique as we can see that there are 2 entries corresponding to the name ‘Rahul’ in row 1 and row 3. Similarly, the course code column is not unique as we can see that there are 2 entries corresponding to course code CS101 in row 2 and row 4. 

However, the tuple (student name, course code) is unique since a student cannot enroll in the same course more than once. So, these 2 columns when combined form the primary key for the database.

As per the second normal form definition, our enrollment table above isn’t in the second normal form. To achieve the same (1NF to 2NF), we can rather break it into 2 tables:

Here the second column is unique and it indicates the enrollment number for the student. Clearly, the enrollment number is unique. Now, we can attach each of these enrollment numbers with course codes.

These 2 tables together provide us with the exact same information as our original table.

Third Normal Form (3NF)

Before we delve into the details of third normal form, let us understand the concept of a functional dependency on a table.

Column A is said to be functionally dependent on column B if changing the value of A may require a change in the value of B. As an example, consider the following table:

Here, the department column is dependent on the professor name column. This is because if in a particular row, we change the name of the professor, we will also have to change the department value. As an example, suppose MA214 is now taken by Prof. Ronald who happens to be from the mathematics department, the table will look like this:

Here, when we changed the name of the professor, we also had to change the department column. This is not desirable since someone who is updating the database may remember to change the name of the professor, but may forget updating the department value. This can cause inconsistency in the database.

Third normal form avoids this by breaking this into separate tables:

Here, the third column is the ID of the professor who’s taking the course.

Here, in the above table, we store the details of the professor against his/her ID. This way, whenever we want to reference the professor somewhere, we don’t have to put the other details of the professor in that table again. We can simply use the ID.

Therefore, in the third normal form, the following conditions are required:

  • The table should be in the second normal form.
  • There should not be any functional dependency.

Boyce-Codd Normal Form (BCNF)

The Boyce-Codd Normal form is a stronger generalization of the third normal form. A table is in Boyce-Codd Normal form if and only if at least one of the following conditions are met for each functional dependency A → B:

  • A is a superkey
  • It is a trivial functional dependency.

Let us first understand what a superkey means. To understand BCNF in DBMS, consider the following BCNF example table:

Here, the first column (course code) is unique across various rows. So, it is a superkey. Consider the combination of columns (course code, professor name). It is also unique across various rows. So, it is also a superkey. A superkey is basically a set of columns such that the value of that set of columns is unique across various rows. That is, no 2 rows have the same set of values for those columns. Some of the superkeys for the table above are:

  • Course code, professor name
  • Course code, professor mobile number

A superkey whose size (number of columns) is the smallest is called a candidate key. For instance, the first superkey above has just 1 column. The second one and the last one have 2 columns. So, the first superkey (Course code) is a candidate key.

Boyce-Codd Normal Form says that if there is a functional dependency A → B, then either A is a superkey or it is a trivial functional dependency. A trivial functional dependency means that all columns of B are contained in the columns of A. For instance, (course code, professor name) → (course code) is a trivial functional dependency because when we know the value of course code and professor name, we do know the value of course code and so, the dependency becomes trivial.

Let us understand what’s going on:

A is a superkey: this means that only and only on a superkey column should it be the case that there is a dependency of other columns. Basically, if a set of columns (B) can be determined knowing some other set of columns (A), then A should be a superkey. Superkey basically determines each row uniquely.

It is a trivial functional dependency: this means that there should be no non-trivial dependency. For instance, we saw how the professor’s department was dependent on the professor’s name. This may create integrity issues since someone may edit the professor’s name without changing the department. This may lead to an inconsistent database. 

Another example would be if a company had employees who work in more than one department. The corresponding database can be decomposed into where the functional dependencies could be such keys as employee ID and employee department.

Fourth normal form

A table is said to be in fourth normal form if there is no two or more, independent and multivalued data describing the relevant entity.

Fifth normal form

A table is in fifth normal form if:

  • It is in its fourth normal form.
  • It cannot be subdivided into any smaller tables without losing some form of information.
  • Normalization is Important for Database Systems

Normalization in DBMS is useful for designing the schema of a database such that there is no data replication which may possibly lead to inconsistencies. While designing the schema for applications, we should always think about how we can make use of these forms.

If you want to learn more about SQL , check out our post on the best SQL certifications . You can also read about SQL vs MySQL to learn about what the two are. To become a data engineer , you’ll need to learn about normalization and a lot more, so get started today.

  • Frequently Asked Questions

1. Does database normalization reduce the database size?

Yes, database normalization does reduce database size. Redundant data is removed, so the database disk storage use becomes smaller.

2. Which normal form can remove all the anomalies in DBMS?

5NF will remove all anomalies. However, generally, most 3NF tables will be free from anomalies.

3. Can database normalization reduce the number of tables?

Database normalization increases the number of tables. This is because we split tables into sub-tables in order to eliminate redundant data.

4. What is the Difference between BCNF and 3NF?

BCNF is an extension of 3NF. The primary difference is that it removes the transitive dependency from a relation.

People are also reading:

  • SQL Courses
  • SQL Certifications
  • Download SQL Cheat Sheet PDF
  • Top DBMS Interview Questions & Answers
  • Difference between MongoDB vs MySQL
  • Create Database in MySQL
  • What is Stored Procedure?
  • Difference between OLTP vs OLAP
  • What is MongoDB?
  • Basic SQL Command

essay on database normalization

Entrepreneur, Coder, Speed-cuber, Blogger, fan of Air crash investigation! Aman Goel is a Computer Science Graduate from IIT Bombay. Fascinated by the world of technology he went on to build his own start-up - AllinCall Research and Solutions to build the next generation of Artificial Intelligence, Machine Learning and Natural Language Processing based solutions to power businesses.

Subscribe to our Newsletter for Articles, News, & Jobs.

Disclosure: Hackr.io is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission.

In this article

  • Download SQL Injection Cheat Sheet PDF for Quick References SQL Cheat Sheets
  • SQL vs MySQL: What’s the Difference and Which One to Choose SQL MySQL
  • What is SQL? A Beginner's Definitive Guide SQL

Please login to leave comments

This video might be helpful to you: https://www.youtube.com/watch?v=B5r8CcTUs5Y

4 years ago

essay on database normalization

Sagar Jaybhay

Very very nice explanation

Tiago Mendes

Thank you for your the tutorial, it was explained well and easy to folow!

2 years ago

nice explanation

2 months ago

Always be in the loop.

Get news once a week, and don't worry — no spam.

{{ errors }}

{{ message }}

  • Help center
  • We ❤️ Feedback
  • Advertise / Partner
  • Write for us
  • Privacy Policy
  • Cookie Policy
  • Change Privacy Settings
  • Disclosure Policy
  • Terms and Conditions
  • Refund Policy

Disclosure: This page may contain affliate links, meaning when you click the links and make a purchase, we receive a commission.

Database Normalization

Introduction.

What are the best choices when designing the schema for a relational database and what is the rationale to decide for one and against some other alternative? Given the amount of vendor-specific recommendations, it is all too easy to overlook basic relational database fundamentals. In this mini-series, I would like to discuss some general Best Practices that I have found to be particularly helpful. Nothing in it is specific to any one vendor’s product and everything should therefore be applicable, regardless which database implementation is being used.

In this article, I attempt to give an approachable introduction to the topic of database normalization and to the five Normal Forms.

Normalization

No discussion of relational database design is complete without a section on normalization. A normalized DB schema avoids certain anomalies when inserting, updating, or deleting data and therefore helps to keep the data in the database consistent. However, the absence of anomalies is only the tangible result of a deeper benefit of normalization: correct identification and modeling of entities. The insert, update, and delete anomalies referred to above are the consequences of the redundancy, which is introduced by improper or lacking separation between distinct entities. The normalization procedure is therefore not just a technical chore to be done out of principle, but can actively help to improve the understanding of the business domain.

Unfortunately, the treatment of normalization is often prohibitively formal, and suffers from a special, rather non-intuitive terminology. This is unfortunate, since the outcome of a normalization procedure often evokes the reaction that it all is nothing more than “Common Sense”. I will try to give explanations of expressions that the reader is likely to encounter in the literature as they come up in the discussion below.

Normalization is a process in which an initial DB design is transformed (or decomposed) into a a different, but equivalent design. The resulting schema is equivalent to the original one in the sense that no information is lost when going from one to the other. The normalization procedure consists of a sequence of “projections”, i.e. tables are split up “vertically”: some attributes are extracted from one table to form an new one. The decomposition is lossless only if we can restore the original table by joining its projections.

Through such non-loss decompositions it is possible to transform an original schema into a resulting one satisfying certain conditions, known as Normal Forms . The Normal Forms span a hierarchy, in such a way, that a schema in a higher normal form automatically fulfills all the criteria for all of the lower normal forms.

  • The First Normal Form (1NF) addresses the structure of an isolated table.
  • The Second (2NF), Third (3NF), and Boyce-Codd (BCNF) Normal Forms address one-to-one and one-to-many relationships.
  • The Fourth (4NF) and Fifth (5NF) Normal Forms deal with many-to-many relationships.

The Fifth Normal Form is the ultimate normal form with respect to projections and joins, i.e. it is guaranteed to be free of anomalies that can be eliminated by taking projections.

In the following discussion, any mention of keys refers to the conceptual keys formed from business data, not to any plainly technical surrogate keys, which might have been defined!

First Normal Form

A table is said to be in First Normal Form, if all entries in it are scalar-valued. Relational database tables are 1NF by construction, since vector-valued entries are forbidden. Vector-valued data (i.e. entries which have more than one value in each row) are referred to as repeating groups .

The following relation violates 1NF, since the SupplierID forms a repeating group (here and in the following, primary key fields are bold face):

Repeating groups indicate a one-to-many relationship — in other words a type of relationship that in relational databases is treated using foreign keys. Note that the problem of repeating groups cannot be solved by adding any number of fields to a record: even if the number of elements of the vector-valued data was fixed, finite, and predetermined, searching for a value in all these parallel fields would be prohibititively cumbersome.

To achieve 1NF, eliminate repeating groups by creating separate tables for each set of related data.

Second and Third Normal Form

To demonstrate the typical anomalies that occur in tables that are only 1NF, consider the following example:

Note the following problems:

  • Insert: It is not possible to add a record for a customer who has never placed an order.
  • Update: To change the address for a customer, this change has to be repeated for all of the customer’s existing orders.
  • Delete: Deleting the last order for a customer loses all information about the customer.

Functional Dependency

Second and Third Normal Form address dependencies among attributes, specifically between key and non-key fields.

By definition, a key uniquely determines a record: knowing the key determines the values of all the other attributes in the table row, i.e. given a key, the values of all the other attributes in the row are fixed.

This kind of relationship can be formalized as follows: Let X and Y be attributes (or sets of attributes) of a given relation. Then Y is functionally dependent on X, if, whenever two records agree on their X-values, they must also agree on their Y-values. In this case, X is called the determinant and Y is called the dependent . Since for any X there must be a single Y, this relationship represents a single-valued functional dependency. If the set of attributes in the determinant is the smallest possible (in the sense that after dropping one or more of the attributes from X, the remaining set of attributes does no longer uniquely determine Y) the dependency is called irreducible .

Note that functional dependency is a semantic relationship: It is the business logic of the problem domain, represented by the relation, which determines whether a certain X determines Y.

Second Normal Form

A table is 2NF if every non-key field is a fact about the entire key. Equivalently: A table is 2NF if it is 1NF and all non-key attributes are functionally dependent on the entire primary key (i.e. the dependency is irreducible).

Clearly, 2NF is only relevant when the key is composite, i.e. consists of several fields. The following example describes a table which is not 2NF, since the WarehouseAddress attribute depends only on WarehouseID but not on PartID :

To achieve 2NF, create separate tables for sets of values that apply to multiple records and relate these tables through foreign keys. The determinants of the initial table become the PKs of the resulting tables.

Third Normal Form

A relation is 3NF, if it is 2NF and none of its attributes is a fact about another non-key field. In other words, no non-key field functionally depends on any other non-key field. (Such indirect dependencies are known as transitive dependencies .)

The following example violates 3NF, since the Location is functionally dependent on the DepartmentID .

To achieve 3NF, eliminate fields that do not depend on the key from the original table and add them to the table whose primary key is their determinant.

To summarize the normalization procedure up to and including Third Normal Form: Every field in a record must depend on The Key (1NF), the Whole Key (2NF), and Nothing But The Key (3NF).

Boyce-Codd Normal Form

Boyce-Codd Normal Form is an extension of 3NF to the case that there are two or more candidate keys, which are composite and overlapping (i.e. have at least one field in common). If these conditions are not fulfilled, 3NF and BCNF are equivalent.

A table is BCNF, if and only if its only determinants are candidate keys.

In the following table, both {SupplierID, PartID} as well as {SupplierName, PartID} are candidate keys. The table is not BCNF, since it contains two determinants ( SupplierID and SupplierName ) which are not candidate keys. SupplierID and SupplierName are determinants, since they determine each other).

However, either of the following decompositions is BCNF:

To achieve BCNF, remove the determinants which are not candidate keys.

Fourth and Fifth Normal Form

Fourth and Fifth Normal Form apply to situations involving many-to-many relationships. In relational databases, many-to-many relationships are expressed through cross-reference tables.

As example, consider a case of class enrollment. Each student can be enrolled in one or more classes, and each class can contain one or more students. Clearly, there is a many-to-many relationship between classes and students. This relationship can be represented by a Student/Class cross-reference table:

The key for this table is the combination of StudentID and ClassID . To avoid violation of 2NF, all other information about each student and each class is stored in separate Student- and Class-tables, respectively.

Note that each StudentID determines not a unique ClassID , but a well-defined, finite set of values. This kind of behaviour is referred to as multi-valued dependency of ClassID on StudentID .

Fourth Normal Form

A table is in Fourth Normal Form if it is 3NF and it does not represent two or more independent many-to-many relationships.

Consider an example where there are two many-to-many relationships: between students and classes, and between classes and teachers. There is also an implied many-to-many relationship between students and teachers. However, the business rules do not constrain this relationship in any way — the combination of StudentID and TeacherID does not contain any additional information, beyond the information implied by the student/class and class/teacher relationships. Consequentially, the student/class and class/teacher relationships are independent of each other — there are no additional constraints for these relationships. The following table is then in violation of 4NF:

As an example of the anomalies that can occur, realize that it is not possible to add a new class taught by some teacher, without adding at least one student who is enrolled in this class.

To achieve 4NF, represent each independent many-to-many relationship through its own cross-reference table.

Fifth Normal Form

A table is in Fifth Normal Form if it is 4NF and its information content cannot be reconstructed from several tables containing fewer attributes.

Consider again the student/class/teacher example, but now assume that there is an additional relationship between students and teachers. The example table above is now 4NF, since all the relationships it describes are interrelated. However, it is not 5NF, since it can be reconstructed from three cross-reference tables, each representing one of the three many-to-many relationships:

To achieve 5NF, isolate interrelated many-to-many relationships, introducing the required number of new tables to represent all business domain constraints.

Normalization in Context

In practice, many databases contain entities that are at least partially violating some of the normalization conditions. The reason is often performance or querying convenience: a de-normalized database may require fewer joins, and can therefore be faster for retrievals. While this may be true, the usual caveats against premature optimization apply here as well as everywhere else: determine sufficiently that a performance problem exists, and that the proposed de-normalization improves it, before introducing a conceptually suboptimal design.

Furthermore, a de-normalized schema can be harder to update. The additional integrity checks that are necessary in this case may offset the performance gains for queries obtained through denormalization.

Originally published on IBM developerWorks in 2003.

essay on database normalization

Normalization in Relational Databases: First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF)

essay on database normalization

Agnieszka is a Chief Content Officer at Vertabelo. Before coming to Vertabelo, she worked as a Java programmer. She has a PhD in mathematics and over 10 years of experience in teaching mathematics and computer science at the University of Warsaw. In her free time, she enjoys reading a good book, going mountain hiking and practicing yoga.

  • database normalization
  • first-normal-form
  • second-normal-form
  • third-normal-form

What is database normalization? What are the different normal forms, and what do they do? Find out in this article.

Normalization in relational databases is a design process that minimizes data redundancy and avoids update anomalies . Basically, you want each piece of information to be stored exactly once; if the information changes, you only have to update it in one place.

The theory of normal forms gives rigorous meaning to these informal concepts. There are many normal forms . In this article, we’ll review the most basic:

  • First normal form (1NF)
  • Second normal form (2NF)
  • Third normal form (3NF)

There are normal forms higher than 3NF, but in practice you usually normalize your database to the third normal form or to the Boyce-Codd normal form , which we won’t cover here.

So, what is this theory of normal forms? It deals with the mathematical construct of relations (which are a little bit different from relational database tables). The normalization process consists of modifying the design through different stages, going from an unnormalized set of relations (tables), to the first normal form, then to the second normal form, and then to the third normal form.

Don’t worry if this sounds complicated; I promise it will get clearer as we go through each step. Let’s start with 1NF – the first step.

First Normal Form (1NF)

A relation is in first normal form (1NF) if (and only if):

  • Each attribute contains only one value.
  • All attribute values are atomic, which means they can’t be broken down into anything smaller.

In practice, 1NF means that you should not have lists or other composite structures as attribute values. Below is an example of a relation that does not satisfy 1NF criteria:

Student courses

This relation is not in 1NF because the courses attribute has multiple values. Jane Smith is assigned to two courses (Databases and Mathematics), and they are stored in one field as a comma-separated list. This list can be broken down into smaller elements (i.e. course subjects: databases as one element, mathematics as another), so it’s not an atomic value.

To transform this relation to the first normal form, we should store each course subject as a single value, so that each student-course assignment is a separate tuple:

If you’re interested in reading more about the first normal form, I recommend the article What Is the Actual Definition of First Normal Form? by my colleague Konrad Zdanowski.

Second Normal Form (2NF)

A relation is in second normal form (2NF) if and only if:

  • It is in 1NF.
  • No non-prime attributes are functionally dependent on a subset of the candidate key(s). In other words, any column that’s not a key column is dependent on the whole information in the candidate key .

What does this mean? If the value of attribute A is determined by the value of attribute S, then A is functionally dependent on S. For example, your age is functionally dependent on your date of birth. For more on functional dependencies, see this article.

Let’s go back to the idea of candidate keys and non-prime attributes. What are they?

  • A candidate key is a minimal set of attributes that determines the other attributes included in the relation. It’s minimal in that if you removed one attribute, the remaining attributes do not form a candidate key. For a more detailed exploration of keys, see this article.
  • A non-prime attribute is an attribute that is not part of the candidate key. However, for a relation to be 2NF, the information stored by non-prime attributes must be related to the whole information in the candidate key.

Informally, the second normal form states that all attributes must depend on the entire candidate key .

Let’s see an example of a relation that does not satisfy 2NF. The underlined attributes are the candidate key.

Bike parts warehouse

  • The candidate key is the part and supplier set, which is expressed like this {part, supplier} .
  • The non-prime attributes (which are not part of the candidate key) are quantity and supplier country .
  • There are functional dependencies between part, supplier, and quantity (expressed as part, supplier → quantity ) and between supplier and supplier country ( expressed as supplier → supplier country) .

Why doesn’t this satisfy 2NF? The set {part, supplier} is the only candidate key of this relation. The value of supplier country is functionally dependent on supplier . Supplier country is not part of the candidate key, so it is a non-prime attribute and it is functionally dependent on part of the candidate key, not the entire candidate key {part, supplier} .

To transform this relation into 2NF, we need to split it into two relations: Bike parts (with the attributes part , supplier , and quantity ) and Suppliers (with the attributes supplier and supplier country ). This would look like as follows:

The relation Bike parts is in 2NF because, as before, the quantity attribute depends on the pair supplier and part .

The Suppliers relation is in 2NF because supplier country is functionally dependent on supplier , which is the candidate key of this relation.

Let’s see one more example of a non-2NF relation.

Student course fees

  • Candidate key: {student, course}
  • Non-prime attributes: grade, course fee
  • Functional dependencies: s tudent, course → grade; course → course fee

The following relation does not satisfy 2NF. The set {student, course} is the relation’s candidate key, but the value of course fee is functionally dependent on course alone. Course fee is a non-prime attribute, which is functionally dependent on only part of the candidate key.

To transform this into 2NF, we again split it into two relations: Student courses (with the attributes student, course , and grade ) and Courses (with the attributes course and course fee ). Thus, we avoid the partial dependency in the non-2NF relation above.

Student course

Why not try verifying for yourself that these relations are indeed 2NF?

Note that the 2NF partial dependency rule only kicks in if your relation has a composite candidate key (i.e. one that consists of multiple attributes). All relations that have a single-attribute key are by definition in 2NF.

Third Normal Form (3NF)

A relation is in third normal form (3NF) if and only if:

  • It is in second normal form (2NF).
  • All non-prime attributes are directly (non-transitively) dependent on the entire candidate key.

In other words, non-prime attributes must be functionally dependent on the key(s), but they must not depend on another non-prime attribute. 3NF non-prime attributes depend on “nothing but the key”.

Let’s see a non-3NF relation:

Order information

  • Candidate key: order_id
  • Non-prime attributes: date, customer, customer email
  • Functional dependencies: date depends on order_id (order_id → date) ; customer depends on order_id (order_id → customer) , and customer email depends on customer (customer → customer email) .

This relation does not satisfy 3NF. The only candidate key in this relation is order_id . The value of customer email is functionally dependent on the customer attribute, which is a non-prime attribute. Thus, the relation violates 3NF.

Once again, we split this into two relations: Orders (with the attributes order_id , date , and customer ) and Customers (with the attributes customer and customer email ):

Orders is in 3NF because the date and customer attributes do not violate the rule of 3NF; their values depend on the order_id number. Customers is in 3NF because customer email is functionally dependent on customer , which is the candidate key of this relation. In both cases, all non-prime attributes depend on the candidate key .

Let’s see one more non-3NF example.

  • Candidate key: {course, year}
  • Non-prime attributes: teacher, teacher date of birth
  • Functional dependencies: teacher depends on course and year (course, year → teacher) ; teacher date of birth depends on teacher (teacher → teacher date of birth)

This relation does not satisfy 3NF. The only candidate key in this relation is {course, year} , but the value of teacher date of birth is functionally dependent on teacher – a non-prime attribute. This violates 3NF.

Guess how we’ll transform this into 3NF? That’s right; we split the relation. Courses gets the attributes course, year, and teacher ; Teachers gets the attributes teacher and teacher date of birth :

Try verifying that these relations are indeed in 3NF for yourself. How would you explain the changes made?

Database Normalization: Summary

First, second, and third normal forms are the basic normal forms in database normalization:

  • The first normal form (1NF) states that each attribute in the relation is atomic.
  • The second normal form (2NF) states that non-prime attributes must be functionally dependent on the entire candidate key.
  • The third normal form (3NF) states that non-prime attributes must be directly (non-transitively) dependent on candidate keys.

Stay tuned to our blog for more articles on database normalization !

You may also like

The boyce-codd normal form (bcnf), update anomalies, what is the actual definition of first normal form (1nf), denormalization: when, why, and how.

go to top

Our website uses cookies. By using this website, you agree to their use in accordance with the browser settings. You can modify your browser settings on your own. For more information see our Privacy Policy .

Aspects of Database Normalization

Database normalization is a database design technique that eliminates undesirable characteristics such as Insertion, Update, and Deletion Anomalies and reduces data redundancy. Normalization rules divide larger tables into smaller tables and use relationships to connect them. SQL normalization aims to eliminate redundant (repetitive) data and ensure logical data storage. This essay explains the database version and provides more information on the miniature database choice of dependencies, keys and its level of normalization applied.

A relational database is a collection of data items linked together by pre-defined relationships. These items are laid out in a table format with columns and rows. Tables store data about the objects that will be represented in the database. A field keeps the actual value of an attribute, while each column in a table holds a specific type of data. The table’s rows represent a group of related matters for a single object or entity (Kalinin and Severing, 2019). A primary key is a unique identifier assigned to each row in a table, and foreign keys link rows from different tables. Without reorganizing the database tables, data can be accessed in various ways.

The database contains several tables: a produce table, Animal product table, grains table, suppliers table, and a purchases table. The tables have various attributes, including ITEMID, SUPPLIERIP, GRAINED, TYPE, TYPE, STOCK QTY, and NEXT DELIVERY for the Grains table. The suppliers’ table contains SUPPLIERID, LAST DELIVERY, SPECIALITY, and ACTIVE attributes. The purchases Table has ITEMID, TOTALBOUGHT, TOTAL SOLD, TOTALREV, and MARGIN attributes. The produce table contains the various details ITEMID, SUPPLIERID, PLU CODE. PRODUCT NAME, TYPE, STOCK QTY, and NXTDELIVERY. Lastly, the animal products table has the following attributes ITEMID SUPPLIERID. ANPRDNAME, TYPE, STOCK QTY, and NXTDELIVERY. The data storage format that can hold a specific type or range of values is referred to as database data types. When data is stored in variables in computer programs, each variable must have its data type. The database data types include char, number, date, which define the attributes in various entities.

In normalization, their three standard levels are commonly known as rules; a “normal form” is the name given to each level. The database is said to be in “first normal form” if the first rule is followed. The database is said to be in “third normal form” if the first three rules are followed (Kalinin and Severin, 2019). The database belongs to the third level of normalization because it observes all three directions. For example, it does not use a single table to store similar data; tables use foreign keys to relate and fields that do not depend on the key.

A constraint that applies to or defines the relationship between attributes is called a dependency key. The SUPPLIERID in the supplier’s table is a dependence attribute. The relation makes the supplier’s table dependent on the other table entries. Lastly, since this is a rational database, the table is highly dependent on primary and foreign keys for relation; for example, the ITEMID is a primary and a foreign key. The miniature database also demonstrates use the NULL query that is satisfied if the column contains a null value or if a condition is evaluated for holding one or more NULL values. The database contains most fields that are allowed to be empty and are indicated in the NULLABLE {NO} column of the table. The language used for scripting the database is SQL (Structured Query Language).

Kalinin, I. V., & Severin, M. V. (2019). Evaluating the effectiveness of database normalization constraints in media stream processing applications. International Multidisciplinary Scientific GeoConference: SGEM , 19 (2.1), 257-261.

Cite this paper

  • Chicago (N-B)
  • Chicago (A-D)

StudyCorgi. (2023, January 8). Aspects of Database Normalization. https://studycorgi.com/aspects-of-database-normalization/

"Aspects of Database Normalization." StudyCorgi , 8 Jan. 2023, studycorgi.com/aspects-of-database-normalization/.

StudyCorgi . (2023) 'Aspects of Database Normalization'. 8 January.

1. StudyCorgi . "Aspects of Database Normalization." January 8, 2023. https://studycorgi.com/aspects-of-database-normalization/.

Bibliography

StudyCorgi . "Aspects of Database Normalization." January 8, 2023. https://studycorgi.com/aspects-of-database-normalization/.

StudyCorgi . 2023. "Aspects of Database Normalization." January 8, 2023. https://studycorgi.com/aspects-of-database-normalization/.

This paper, “Aspects of Database Normalization”, was written and voluntary submitted to our free essay database by a straight-A student. Please ensure you properly reference the paper if you're using it to write your assignment.

Before publication, the StudyCorgi editorial team proofread and checked the paper to make sure it meets the highest standards in terms of grammar, punctuation, style, fact accuracy, copyright issues, and inclusive language. Last updated: January 8, 2023 .

If you are the author of this paper and no longer wish to have it published on StudyCorgi, request the removal . Please use the “ Donate your paper ” form to submit an essay.

We use cookies to enhance our website for you. Proceed if you agree to this policy or learn more about it.

  • Essay Database >
  • Essay Examples >
  • Essays Topics >
  • Essay on Business

Good Database Normalization Essay Example

Type of paper: Essay

Topic: Business , Management , Relationships , Students , Life , Law , Information , Data Analysis

Published: 03/02/2020

ORDER PAPER LIKE THIS

- Describe the steps that you would use in order to convert database tables to the First Normal Form, the Second Normal Form, and the Third Normal Form Normalization is an approach of decomposing tables with an aim of eliminating redundant data and ensuring that data dependencies make sense through logical arrangement and storage. Normalization of a database allows handling and update of the database. Data storage should happen once to avoid storage of data that can undergo calculation from existing data in the databases. It prevents data loss. During the normalization process, removal of redundancies is important. Following of data integrity rules is important for the satisfaction of all integrity constraints. There are rules followed in database normalization. The rules include the first normal form, second normal form, third normal form and BCNF (Sumathi & Esakkirajan, 2007). The normal forms of data indicate how much redundancy exists in the data. The normalization process involves the following steps: - Specifying the key that brings the relationships - Specifying the functional dependencies of the relationship - Application of definition for each of the normal functions - If the relations do not meet the definition of a normal form, the relationship is changed to form new relations until they meet the definition function. - Testing of the new relations to ensure that they meet the definitions required by each of the normal forms. - Provide one (1) example that is relevant to a college environment that illustrates reasons for converting database tables to the First, Second, and Third Normal Forms

First Normal Form

A table without normalization is difficult to handle and update it in the database without encountering insertion, data loss and deletion. A good example is a student table. It is evident that the student name Gad appears twice in the table and the subject Mathematics appears twice. This is contrary to the First Normal Form. To make the above table to the First Normal Form breaking of the table in two forms is necessary.

New Student Table

Both the student table and the subject table possess the attributes of first normal form. A table for normalization to the second Normal Form must meet all the requirements of the First Normal Form. Partial dependency of column on primary key should not exist which means that each of the column in the table that is not the primary key must show dependence upon the entire key for its existence. - Explain typical situations when denormalizing a table is acceptable. Provide one (1) example of denormalizing a database table to justify your response. Denomalization is important for enhancing performance. Denormalization is use in removal of joints and avoiding of queries. However, before denormalization, one must look for indicators that help in identification of tables that require denormalization (Ricardo, 2012). The indicators include large primary keys that are clumsy to query and those that consume large volume of space when carried as key columns. Other indicators include existence of repeated groups. There are critical issues involved in deciding to denormalize. They include planning how to keep the data in a synchronized form and refactoring the queries using the denormalized fields. Key Key - Explain the significant manner in which business rules influence both database normalization and the decision to denormalize database tables. Design and implementation of a database should satisfy the needs of the organization. Organizations require logical data modeling that is consistent and accurate. The database normalization should not allow for data losses because organizations know the importance of having accurate information (Prat & Adamski, 2001). All businesses have rules and all people that deal with the business rely on information given to make decisions. Databases systems should provide with decision makers with the correct type of information. It should also supply information to suppliers, customers and other stakeholders of the business

ReferencesTop of Form

Pratt, P., & Adamski, J. (2000). Concepts of database management (1st ed.). Cambridge, Mass.: Course Technology. Ricardo, C. M. (2012). Databases illuminated. Sudbury, MA: Jones & Bartlett Learning. Sumathi, S., & Esakkirajan, S. (2007). Fundamentals of relational database management systems. Berlin: Springer.

double-banner

Cite this page

Share with friends using:

Removal Request

Removal Request

Finished papers: 1012

This paper is created by writer with

ID 272304038

If you want your paper to be:

Well-researched, fact-checked, and accurate

Original, fresh, based on current data

Eloquently written and immaculately formatted

275 words = 1 page double-spaced

submit your paper

Get your papers done by pros!

Other Pages

Cautionary use of antibiotics in acute otitis media course work example, does integrity affect leadership essay, christian legal society case study sample, free report on economic recession, thoughts on alexies lahiris and erdrichs short stories literature review examples, accounting essay examples 4, self harm essay sample, marnie movie review, applied physics i newtons laws case study, course work on rhetorical analysis memorandum, managing and improving performance quality in a clinical organization research paper examples, bioethics critical thinking, case study on mcdonalds ethical standards violations, motown essays examples, example of course work on optical illusions, free business plan about market study mp3xyz, example of nazis and jews essay, good example of course work on error 12, free article review on scholarly journal review, alzheimers disease essay 2, free domain models essay example, sample essay on genre analysis, inhalant essays, tradesman essays, decompressing essays, koto essays, electron gun essays, impersonator essays, septum essays, pet peeve essays, edwin essays, franco essays, health advocacy essays, fruitful essays, technological change essays, eating movie reviews, authority movie reviews, house movie reviews, courage movie reviews, planet movie reviews, cerebrations research papers, deviancy research papers.

Password recovery email has been sent to [email protected]

Use your new password to log in

You are not register!

By clicking Register, you agree to our Terms of Service and that you have read our Privacy Policy .

Now you can download documents directly to your device!

Check your email! An email with your password has already been sent to you! Now you can download documents directly to your device.

or Use the QR code to Save this Paper to Your Phone

The sample is NOT original!

Short on a deadline?

Don't waste time. Get help with 11% off using code - GETWOWED

No, thanks! I'm fine with missing my deadline

  • Open access
  • Published: 06 May 2024

Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets

  • Raquel Cuevas-Diaz Duran 1   na1 ,
  • Haichao Wei 2 , 3   na1 &
  • Jiaqian Wu 2 , 3 , 4  

BMC Genomics volume  25 , Article number:  444 ( 2024 ) Cite this article

228 Accesses

Metrics details

Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data.

The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis.

Conclusions

According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.

Peer Review reports

Single-cell RNA-sequencing (scRNA-seq) has become a powerful approach to simultaneously quantify the transcription of hundreds or even thousands of features (genes, transcripts, exons) at an unprecedented resolution. This high-throughput transcriptomic profiling assays have helped scientists to study important biological questions, for example, cellular heterogeneity, dynamics of cellular processes and pathways, novel cell type discovery, and cell fate decisions and differentiation [ 1 , 2 , 3 , 4 ].

While the expression matrices obtained from bulk RNA-seq are structurally very similar to those derived from scRNA-seq experiments, there are distinct features that characterize scRNA-seq datasets mainly due to the scarcity of starting material and the high resolution. These features include an unusually high abundance of zeros, an increased cell-to-cell variability, and complex expression distributions. This high intercellular variability of read counts or overdispersion is derived from biological and technical factors [ 5 ]. Understanding the contribution of each of these factors to the global dispersion is important since technical variability can be confounded by biological differences. Thus, statistical and computational methods used for analyzing scRNA-seq datasets face the challenge of separating wanted from unwanted variation.

Many normalization methods exist for bulk RNA-seq and have been applied to scRNA-seq. However, the specific features of scRNA-seq datasets have triggered the development of specific normalization strategies. Herein, we briefly describe the commonly used methods for scRNA-seq, including isolation and library preparation protocols. We also discuss the causes and effects of technical and biological sources of variability, focusing mainly on those derived from measurement inefficiencies. Next, we summarize state-of-the-art normalization methods, incorporating those that have been specifically tailored to scRNA-seq datasets. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics that are commonly used to evaluate the performance of normalization methods. Finally, we highlight commonly used toolkits and provide practical recommendations for scRNA-seq users.

Single-cell RNA-sequencing methods

The first step in a scRNA-seq experiment is the preparation of a high-quality single-cell suspension. Single-nuclei can also be isolated, however, for simplicity we will refer to both as single-cells. The condition of the cells isolated is critical for a successful experiment. Isolation methods can expose cells to harsh enzymatic methods or chemical conditions that can stress cells and generate unwanted variations in gene expression [ 6 ]. Single-cells can be isolated from suspensions (e.g. blood) or from solid tissues (e.g. tumor). Samples can be obtained from fresh (e.g. resection surgeries, cell cultures) or preserved sources (e.g. postmortem brains). The protocols for preparing cell suspensions depend on the source of cells and pilot experiments may be required to ensure the optimal condition of cells.

Isolating single-cells

Cells within the suspension need to be isolated or captured to obtain individual reaction volumes. To date, numerous isolation methods have been used including manual methods (serial limited dilution, microdissection or pipetting [ 7 ]) and automated technologies (fluorescence/magnetic-activated cell sorting (FACS/MACS) [ 8 , 9 ] or microfluidics [ 10 ]). Depending on the research question, certain applications are better suited for cell isolation. For example, profiling of cancer cells requires the exclusion of blood cells, thus FACS or MACS may be used to filter the cellular suspension. Applications in which an unbiased view of the cellular composition is desired do not require filtering. In this case the cellular suspension can be directly used as input in a microfluidics system in an adequate dilution. The three most common workflows used to isolate single cells are microtiter plates, microfluidics, and droplets/nanowells, as shown in Fig.  1 . See Additional file 1 for an extended list of methods and characteristics.

figure 1

Overview of common scRNA-seq workflows and their characteristics. *only in cases where the volumes to be added for each reagent can be modified and don’t depend on the design of the reaction chamber. IFC = integrated fluidics circuits, RT = reverse transcription, TSO = template-switching oligonucleotide, UMI = unique molecular identifiers, PCR = polymerase chain reaction, IVT = in vitro transcription, NA = not available, KOAc = potassium acetate, MgOAc = magnesium acetate. Figure in droplets/nanowells column was adapted from [ 12 ]

The most representative example of microplate-based cell-isolation methods is FACS. FACS sorts cells into wells or microtiter plates where they are ready for manual or automated library preparation. The advantage of this method is that it allows the exclusion of dead or damaged cells and the enrichment of cells depicting specific antibody-labelled proteins. Furthermore, microplates can be imaged to ensure that no doublets or empty wells are present. Reagents required for lysis and library preparation are then added to each well. Microfluidics based cell-isolation methods include the use of integrated fluidics circuits (IFCs), typically the Fluidigm C1 System. An IFC consists of a chip with miniature lanes that contain traps. The cellular suspension flows through the chip and cells are caught in each trap. Then, reagents for lysis and library preparation flow through the chip and cells are processed in consecutive nanoliter reaction chambers. Another method also uses a microfluidic system but instead of using traps, it encapsulates cells in droplets or captures them in nanowells. Droplet-based systems use a water-in-oil emulsion to encapsulate single-cells. This drop of emulsion contains reagents for RT (reverse transcription) as well as randomly introduced barcodes for tagging cells. Common droplet-based platforms include inDrops [ 11 ], Drop-Seq [ 12 ] and 10X Genomics [ 13 ]. In nanowell platforms cells are loaded into nanowells with pre-loaded barcoded beads. A typical example of this platform is Seq-Well [ 14 ].

Capturing mRNA molecules and generating cDNA

Once the cells are isolated, they are lysed and exogenous spike-in RNA molecules, for example the External RNA Control Consortium (ERCCs) spike-ins [ 15 ], may be added. Spike-in RNAs are used to create a standard baseline measurement for counting and normalization [ 16 ]. As will be described in the next section, the addition of spike-ins is not feasible for all platforms. After cell lysis, poly(A)-tailed mRNA is captured by poly(T) oligonucleotides and then reverse transcribed into cDNA. Importantly, the poly(T) oligonucleotides may include single-cell-specific barcodes for cell identification and a random nucleotide sequence that will be used as a unique molecule identifier (UMI). UMIs are used for efficiently counting mRNA molecules and correcting PCR-induced artifacts [ 17 ] as will be described in the next section.

Amplifying cDNA

After RT, cDNA is amplified typically by PCR or T7-based in vitro transcription (IVT). PCR amplification is commonly performed using two methods: Tang protocol and template-switching oligonucleotides (TSO). In the Tang protocol [ 18 ], mRNAs are reverse transcribed into cDNA using poly(T) primers with an anchor sequence (UP1). Then poly(A) tails are added to the 3’ ends of cDNAs, and second strands are synthesized using poly(T) primers with another anchor (UP2). Finally, cDNAs are PCR-amplified using both anchors. In the TSO protocol, the reverse transcriptase adds cytosines to the cDNA allowing the template switching reaction and the addition of PCR adaptor sequences. Variants of the TSO protocol are implemented in single-cell tagged reverse transcription sequencing (STRT-seq) [ 19 ], switching mechanisms at the 5’-end of the RNA transcript sequencing (Smart-seq) [ 20 ], and Smart-seq2 [ 21 ]. These sequencing protocols can be performed in the microtiter plate and IFC platforms in combination with tagmentation methods for sequencing library preparation. Tagmentation involves using an enzyme that simultaneous generates fragments and adds cell indexes.

PCR-based methods are exponential and non-linear amplification techniques. They are more efficient than IVT methods, however both introduce technical biases as will be described. IVT requires the addition of a T7 promoter in the poly(T) primer and it doesn’t require template switching. Numerous platforms use T7-based IVT amplification, for example cell expression by linear amplification and sequencing (CEL-seq) [ 7 ], CEL-seq2 [ 22 ], massively parallel single-cell RNA sequencing (MARS-seq) [ 23 ], and indexing droplets RNA sequencing (inDrops-seq) [ 24 ]. Amplified cDNA or RNA (PCR or IVT) is fragmented during library preparation and adaptors are added. Different fragmentation methods can be used, for example, tagmentation or mechanical fragmentation.

Transcript coverage

An important consideration is the transcript coverage when selecting the scRNA-seq protocol. Expression profiling of single-cells can be done by sequencing full-length transcripts or by merely counting 3’ or 5’ molecule ends, referred to as digital counting (see Fig.  1 and Additional file 1 ). Full-length scRNA-seq protocols offers several advantages, for example, the detection of low-expressed transcripts [ 25 ], splice variants and isoforms, single-nucleotide variants [ 26 , 27 ], and fusion transcripts [ 28 ]. However, full-length sequencing methods are limited by lower cellular throughputs and higher costs [ 29 ]. Moreover, until recently, commercial plate-based full-length sequencing protocols did not incorporate UMIs [ 30 ]. Novel full-length sequencing methods now integrate UMI’s in the TSO sequence increasing transcript quantification accuracy. Examples of these methods include Smart-Seq3 [ 31 ], Smart-seqxpress [ 32 ], and Flash-seq [ 33 ]. Another disadvantage of full-length protocols is that they do not allow early cell barcoding and thus, pooling can’t be performed. Droplet-based methods rely on digital counting, representing a cost-effective alternative. However, since these methods sequence only a small fragment of the 3’ or 5’ end of transcripts, isoform identification becomes highly challenging [ 34 , 35 ]. Methods for quantifying isoforms from 3’ droplet-based assays (e.g. 10X Genomics) are emerging. For example, Scasa [ 35 ], a method that estimates isoform expression based on transcription clusters and isoform paralogs, and STARsolo [ 36 ], a mapping/quantification tool that has been used to quantify splicing events in 3’ droplet-based datasets.

General scRNA-seq approaches

Overall, there are two common approaches to scRNA-seq: isolating a large number of cells and sequencing libraries in a low depth (e.g. droplet-based) or isolating fewer cells and implementing a higher sequencing depth (e.g. microplate-based). Detailed descriptions of each platform have been reviewed in [ 25 , 37 , 38 , 39 ]. A prominent multicenter benchmarking study was performed to evaluate the performance of 13 commonly used scRNA-seq protocols including plate-based methods and microfluidic systems (droplets, nanowells, and IFC) [ 29 ]. In this study, a complex reference sample (high cell-type heterogeneity, closely related subpopulations, known cell composition and cell markers) was used to compare the capability of these protocols in describing tissue complexity [ 29 ]. Authors demonstrated differences among the protocols in library complexity and in their ability to detect cell subpopulation markers. Therefore, users should make informed decisions when designing a single-cell RNA-seq study to detect an adequate number and complexity of RNA molecules that can predict the cell phenotypes and infer their function.

The challenges of single-cell datasets

Compared to bulk RNA-seq, scRNA-seq suffers from a high cell-to-cell variability, also referred to as “overdispersion”. The dispersion observed in gene counts of cells from the same type is a combination of two sources of variability, technical and biological (see Fig.  2 a). Technical variability or noise is derived from an imperfect measurement process, as is the case of scRNA-seq [ 40 ]. Sources of technical variability include capture inefficiency (Fig.  2 b), zero counts (Fig.  2 c), amplification bias (Fig.  2 d), sequencing depth and coverage (Fig.  2 e), library size (Fig.  2 f), sequencing inefficiency (Fig.  2 g), and batch effects. Additionally, individual cell’s read counts depict biological variability due to various factors, for example, transcriptional bursting, cell subpopulation, cell cycle stage, cell size, cell transient stages, and gender differences.

figure 2

Sources of variability in scRNA-seq datasets. ( a ) Technical and biological variability. ( b ) Capture inefficiency resulting from selection of single-cells and random reverse transcription of mRNA molecules. Colored lines represent different mRNAs. ( c ) Density plot depicting a typical bimodal distribution with a zero inflated behavior representing the number of gene counts across cells. ( d ) Bias observed in IVT or PCR amplification. Certain mRNA molecules are amplified more efficiently than others. ( e ) Sequencing depth and coverage. ( f ) Histogram depicting variable library sizes across cells. ( g ) Sequencing inefficiency showing numerous cells sampled but not sequenced due to errors in measurement

Capture inefficiency and zero counts

A typical mammalian cell contains between 50,000 and 300,000 different transcripts with each molecule depicting between 1 and 30 copies per cell [ 41 , 42 ]. Due to these very low amounts of transcripts per cell, the methods used to capture, reverse transcribe, amplify, and prepare the sequencing libraries, are inefficient in faithfully representing the number of mRNA molecules per gene per cell. For example, after cell lysis, mRNA is converted into the more stable cDNA generally through RT, also known as first strand synthesis. It has been demonstrated that the small concentration of initial mRNA increases the probability of missing transcripts in the RT stage thus, generating “dropout” events [ 43 ]. Kharchenko et al. referred to dropouts as events in which a gene appears highly expressed in one cell but not detected in another one, due to inaccuracies in the RT step [ 43 ]. In most protocols, RT is initiated from the poly-A tails of mRNAs through oligo-(dT) priming, commonly including overhangs with adapter sequences, cell barcodes and UMIs. Importantly, it has been demonstrated that the efficiency by which the oligo-(dT) primers capture mRNAs is correlated to the length of the poly-A tails [ 44 ] which may undergo changes in response to physiological and pathological processes [ 45 ]. Thus, the RT process is a source of stochasticity.

Dropout events frequently lead to excessive zeros, one of the most prominent features of scRNA-seq datasets. These are mainly due to the low amounts of starting material, capturing and amplification inefficiencies, and the low sequencing depths which are commonly used. Intriguingly, even deeply sequenced datasets depict up to 50% of expression values with zero counts [ 46 ]. Overall, the efficiency of capturing an mRNA molecule, converting it to cDNA, and successfully amplifying it is low and variable, ranging from 10 to 40% [ 7 , 38 , 47 , 48 ]. This is why genes that depict a low expression have a high probability of not being detected and becoming a dropout. Thus, scRNA-seq computational methods face the challenge of distinguishing real zero counts from those generated from technical variations (measurement errors) [ 40 ].

Amplification bias, sequencing depth, coverage, and library size

After RT, second strand synthesis takes place from either a random position or from the end of the first-strand as part of the amplification process. Importantly, both RT and DNA polymerase are processive enzymes that can incorporate large numbers of nucleotides in consecutive reactions before the reaction stops [ 49 , 50 ]. Consequently, the exact stopping points are unknown. This introduces complex positional dependencies and generates global bias, affecting sequencing coverage [ 49 ]. Sequencing depth and coverage are closely related terms referring respectively to the number of times a specific base of the DNA is sequenced and to the proportion of the genome that was sequenced with a certain depth (Fig.  2 f). Sequencing depth can be configured as a parameter of the sequencer. A higher sequencing depth may increase coverage at the expense of cost. However, capture inefficiencies and amplification biases have an impact on coverage no matter the sequencing depth, and they need to be corrected.

Given the minimum amount of starting material, the library preparation process requires more than a million-fold amplification [ 43 ]. This extensive amplification (either PCR or IVT) leads to additional technical variability, given that some genes may experience preferential amplification [ 51 , 52 , 53 ]. Capture inefficiencies and amplification biases generate variable library sizes, defined as the total number of reads per cell. Normalization methods aim at estimating a “library size factor” to correct cell-specific biases related to the number of reads per cell.

The amplification process can also generate dropout events. Thus, UMIs are introduced and they have been reported to substantially reduce unwanted variation due to differences in gene lengths and amplification efficiencies [ 17 ]. UMIs are random sequences that are used for tagging cDNA molecules in the 5’ end during RT enabling the accurate quantification of mRNAs by establishing a specific identity for each molecule. Adding UMIs to the reactions before PCR amplification also allows for the bioinformatic identification of PCR duplicates. To date, the majority of scRNA-seq protocols allow transcript UMI-tagging (See Additional file 1 ).

Batch effects

Another important source of technical variation comes with batch effects. Batch effects are common in high-throughput experiments, and they occur when cells from one group or condition are processed (cultured, isolated, prepared library and sequenced) separately (space or time) from cells of another condition [ 54 ]. Batch effects also occur when single-cell datasets are compiled from multiple experiments, for example, when integrating large single-cell atlases [ 55 , 56 ]. In these cases, experiments are most likely performed with different technologies, capturing times, handling personnel, reagents, and equipment. Removing batch effects is a critical and challenging step. Furthermore, studies have demonstrated that batch effects can be highly nonlinear, therefore it is difficult to adjust technical variability without introducing artifacts or confounding real biological variation [ 54 ]. Batch effect adjustment methods will be described in the next section.

Spike-ins may account for technical variability

An alternative proposed to account for the sources of technical variability described is the use of spike-ins [ 57 ]. Spike-ins are non-biological RNA molecules that are added in a fixed concentration to each cell’s lysate and undergo the same processing as endogenous transcripts. In this way, spike-in transcripts are affected by the same inefficient capturing and amplification processes, and after sequencing, the number of spike-in molecules can be compared to the counts obtained and used as a scaling factor for normalization [ 16 , 58 , 59 ]. However, spike-ins can’t easily be incorporated into high throughput cell isolation protocols (e.g. Droplet-based) and in other cases, it is not feasible to consistently add the same quantity of spike-in RNA to every cell [ 60 ]. Furthermore, the use of spike-ins has been questioned arguing that synthetic spike-ins behave differently than endogenous transcripts [ 61 ].

Sources of biological variability

Biological variability is one of the main interests in scRNA-seq and it is the basis of numerous downstream analyses, for example, clustering and differential gene expression. An important complication in addressing biological variability, besides separating it from technical noise, is that gene expression is inherently stochastic. Researchers have demonstrated substantial variability in the amount of mRNA even between genetically identical cells grown under the same conditions [ 62 , 63 ]. This variability has been explained partially by a stochastic phenomenon known as transcriptional “bursting” [ 64 ]. Using gene trap and transgenic cell lines, Suter et al. found that most genes appear to have dynamic fluctuations of expression separated by silent intervals, generating gene-specific temporal transcription patterns [ 64 ]. Furthermore, it has been established that gene transcription and protein translation are regulated by combinatorial interactions between molecules undergoing random biochemical reactions [ 63 , 65 ]. Additionally, the same gene will not be transcribed simultaneously in different cells since individual cells are engaged in dynamic physiological processes, for example, stress response, cell cycle or transient cellular states. Overall, scRNA-seq computational methods must be able to separate the wanted from unwanted variability in datasets characterized by noise (dispersion), abundant zeros, and high-magnitude outliers.

Normalization methods

An essential first step in the analysis of scRNA-seq data is normalization, whose main aim is to make expression counts comparable within and between cells. Normalization has a strong impact on the detection of differentially expressed genes [ 66 , 67 , 68 ] and thus in the number of cell clusters identified. Adequate normalization methods are essential since they underlie the validity of downstream analysis. A normalization pipeline generally includes a combination of imputation, normalization, and batch effect correction processes. However, certain normalization methods, for example, ZIMB-WaVE [ 69 ] and Seurat [ 70 ] perform all processes.

An early decision in the normalization pipeline selection is whether an imputation method should be included. Recently developed single-cell isolation methods, for example, droplet-based methods yield an incredibly high number of zeros (sometimes exceeding 90%) in the expression matrix [ 71 ]. Thus, imputation methods have been proposed. A comprehensive compendium of imputation methods is described by Lähnemann et al. [ 72 ]. The main aim of these methods is to predict read counts in cases were experimental or technical noise has led to zero counts, thus generating adjusted data values that better represent true expression. Data smoothing methods, such as Markov Affinity-based Graph Imputation of Cells (MAGIC) [ 73 ] detect all zeros as “missing data” and output a matrix with zeros smoothed out. However, the main challenge of these methods is preserving biological zeros. This is especially important in cases where the lack of expression of marker genes is needed to identify a subpopulation of cells [ 74 , 75 ]. In such cases, the use of model-based or data reconstruction methods that can selectively preserve zeros, for example ALRA [ 75 ], SAVER [ 76 ], and scImpute [ 77 ] is suggested [ 72 ].

Some imputation tools use raw scRNA-seq UMI or read counts as input, while others require a normalized count matrix, typically a log-transformation. A log transformation of read counts attempts to reduce the skewness. Researchers have demonstrated that directly processing an expression matrix with a high incidence of zeros may be detrimental for downstream analysis such as clustering and visualization [ 75 ]. However, an extensive evaluation found no improvement in the performance of imputation methods against no imputation when comparing clustering and trajectory analysis results [ 78 ]. Another study found that some imputation methods introduced false positive signals when identifying differentially expressed genes [ 74 ]. Imputation methods are beneficial when the amount of sparsity (biological and technical zeros) is unusually high or when downstream algorithms can’t handle sparse count data [ 72 ]. Nevertheless, there is no consensus on the advantages of using imputation algorithms.

Normalization methods are performed after imputation or at the beginning of the pipeline in case imputation was not selected. Broadly, normalization methods can be classified as within and between-sample algorithms according to the correction performed. In the former, counts are adjusted to account for gene-specific features, for example GC-content and gene length, yielding comparable gene expression values within each cell. In the latter, cell-specific features are addressed, for example sequencing depth, resulting in comparable gene expression values across cells [ 79 ]. Most methods can use read counts or UMI counts. UMI counts remove amplification biases in non-zero gene count measurements [ 17 ]. However, UMIs do not recover sampling zeros. The choice of normalization method (with or without UMIs) is a statistical consideration and is not correlated to the proportion of zeros or the distinction between technical and biological zeros [ 80 ].

According to the mathematical model used, normalization approaches can further be classified into global scaling methods, generalized linear models (GLMs), mixed methods, and machine learning-based methods. Additional examples of methods from each category are included in Additional file 2 . Furthermore, a compilation of independent benchmarking studies evaluating the performance of normalization methods is found in Additional file 3 . Given the importance of normalization methods on the validity of downstream analysis, we will describe common methods belonging to each category. We also discuss batch effect correction methods as the last step in a normalization pipeline.

Global scaling methods

Global scaling normalization methods assume that the RNA content is constant for all cells and therefore, a scale factor can be applied to all genes so that there is no difference in expression between cells. These methods are based on the calculation of size factors for each cell to account for differences in library size. For each cell, counts are divided by their corresponding size factors, generating relative abundances. The simplest approach using this assumption is library size normalization, for example transcripts or counts per million (TPM [ 81 ], CPM [ 82 ]) or reads per kilobase of exon model per million mapped reads (RPKM) [ 83 ]. However, these methods are affected by a small proportion of highly expressed genes and can bias differential gene expression results [ 66 ]. Normalization methods that address gene length bias, for example TPM and RPKM, are suggested for plate-based full-length sequencing methods. In contrast, droplet-based methods that use UMIs, tag only 3’ or 5’ ends of transcripts and are not affected by gene length [ 84 ].

A set of global scaling methods rely on the use of external spike-ins added in a known concentration and processed in parallel with endogenous transcripts. The number of read or UMI counts for spike-in transcripts is then used to scale the counts for each cell, making spike-in gene counts the same across all cells [ 59 ]. The caveats of using spike-ins have been previously described. An alternative to spike-in normalization is using a set of genes that have constant expression across cells. These can be housekeeping genes or stably expressed genes. The use of housekeeping genes has been criticized because they may be affected by transcriptional bursting. Lin et al. proposed the ISnorm (Internal Spike-in-like-genes normalization) algorithm to select stably expressed genes based on their pairwise variance and use them to estimate unbiased size factors [ 85 ]. A pioneering approach expected to become a gold standard for single-cell RNA counting consists on using molecular spikes [ 86 ]. Molecular spikes are RNA spike-ins that contain built-in UMIs enabling the detection, quantification, and correction of artifactual RNA counting even in experiments lacking UMIs. Researchers demonstrated that molecular spikes allow the accurate estimation of total mRNA counts across cells [ 86 ].

Other global scaling methods have been adopted from bulk RNA-seq analysis, for example DESeq’s median of ratios [ 87 ] and EdgeR’s trimmed mean of M values (TMM) [ 60 ]. In DESeq2’s method, a pseudo reference sample is created from the geometric mean of genes across cells, and it is used to generate a sample-specific scaling factor [ 87 , 88 ]. TMM filters out highly expressed genes as well as those with a large variation and a weighted average of the remaining genes is used to calculate a normalization factor [ 60 , 88 ]. These methods rely on the assumption that most genes are not differentially expressed. Furthermore, the high frequency of zeros in scRNA-seq datasets may result in nonsensical scaling factors (DESeq2) or undefined M values (TMM) [ 89 ].

Most methods implemented for between-sample normalization calculate global scaling factors which are applied to all gene counts of a cell to adjust for sequencing depth. However, these methods fail due to the technical biases described. One of the most common systematic variations observed in scRNA-seq is the unequal relationship between transcript expression and sequencing depth. Global scaling normalization methods can not accurately adjust cell counts in respect to sequencing depth when the ratio is uneven and depends on the expression level. These methods will generate an over-correction for genes with low to moderate expression as well as an under-normalization for highly expressed genes [ 66 ]. To circumvent this problem, some global scaling normalization methods rely on pre-clustering (pooling) strategies as will be described.

One of the first normalization methods specifically developed for scRNA-seq was BASiCS (Bayesian Analysis of Single-Cell Sequencing Data) [ 90 ]. BASiCS implements an integrated Bayesian hierarchical model to infer cell-specific normalizing constants based on distinguishing technical noise from biological variability [ 90 ]. The original implementation of BASiCS relied on the use of spike-ins to estimate technical noise; however, the method was extended to work with multiple independent replicates [ 91 ]. It is important to note that BASiCS was designed to be implemented in scenarios where the cell types under study are known a priori, thus unsupervised settings are not recommended ( https://github.com/catavallejos/BASiCS ). Another highly used scaling-based normalization method is scran. Compared to other methods, scran groups cells with similar library sizes (pre-clustering), estimates a pool-specific factor by summing expression values across pools, and then estimates cell-specific size factors by deconvolving pooled factors [ 89 ]. This deconvolution method is implemented in the computeSumFactors function of the scran R package [ 92 ].

A study performed by Buttner et al. compared the batch correction performance of 7 global scaling normalization methods including CPM based on library size, relative log expression, TMM, TPM, qsmooth [ 93 ], mean ratios, and scran size factor estimation, and demonstrated that scran outperformed other normalization methods [ 94 ]. Another benchmarking study assessed the performance of scran, SCnorm, Linnorm, Census, MR, and TMM [ 95 ]. Researchers concluded that scran was the best normalization method due to its good performance in common scRNA-seq scenarios with a high number of DEGs and differing levels of mRNA between cells [ 95 ]. Interesting results were reported by Ahlmann-Eltze and Huber in a benchmarking study where 22 transformations were applied to UMI-based scRNA-seq datasets [ 96 ]. The transformations used had the objective of adjusting UMI counts for variance stabilization, and they included delta method-based, residuals-based, latent gene expression-based, and count-based factor analysis transformations. The best performing transformation was the logarithm with a pseudo-count followed by PCA according to k-nearest neighbor (k-NN) based metrics.

A major caveat of global scaling factor normalization methods is that they assume that RNA content is constant for all cells and use the same scaling factor for all genes. Therefore, alternative normalization methods have been proposed.

Generalized linear models

Initial comparisons of the expression of genes between cells of the same type demonstrated that they were lognormally [ 97 ] or Gamma distributed [ 98 ]. Others have suggested that models of gene expression should incorporate the thermodynamic contribution to technical noise, which follows a Poisson distribution [ 99 ]. Mixed Poisson distributions have been widely used to model non-homogeneous scRNA-seq datasets, for example Beta-Poisson [ 100 ] and Gamma-Poisson [ 101 , 102 , 103 , 104 ]. The implementations of these models can be extended to allow variations between cells using GLMs.

GLMs are a statistical tool used to model the contribution of systematic and random components to a response variable (gene or UMI counts). GLMs include classical linear regression models and count-based models. Furthermore, GLMs allow the modeler to express a relationship between covariates, that will be regressed out, and a response variable in a linear, additive manner [ 105 ]. In this sense, covariates account for unwanted technical variability, for example sequencing depth, while biological variability is captured in the response variable. A commonly used regression model is Linnorm, a linear model and normality-based transformation method. Linnorm calculates normalization and transformation parameters based on stably expressed genes across different cells and fits the log-transformed expression data to a linear model [ 106 ]. Other common regression-based normalization approaches are PsiNorm and SCnorm. PsiNorm performs normalization between samples by fitting data into a Pareto power-law distribution providing comparable performance as scran and Linnorm with shorter runtime and memory efficiency [ 107 ]. SCnorm first performs a quantile regression for every gene to determine the dependence of gene-specific expression on sequencing depth, and then a second quantile regression estimates scale factors for groups of genes [ 108 ]. A benchmarking study systematically compared the performance of combinations of methods for normalization and imputation, clustering, trajectory analysis, and data integration [ 109 ]. Authors evaluated 8 popular normalization methods including BASiCS, scran, SCnorm, and Linnorm using mixtures of cells or RNA by calculating the silhouette width of clusters and the Pearson correlation coefficient of normalized gene expression. This pioneering mixology experiment demonstrated that scran and Linnorm had consistent satisfactory results and Linnorm’s performance was invariant to the input dataset [ 109 ].

GLMs have also been proposed to model read counts using probability distributions. Commonly used count distributions to model gene counts across single-cells include non-zero inflated: Poisson and negative binomial (NB), and zero-inflated: Poisson (ZIP) and NB (ZINB). These methods have slight differences in how they calculate the probability of zero counts. Poisson methods have only one parameter, \(\lambda\) corresponding to mean and variance. The assumption of Poisson normalization methods is that the frequency of a given transcript is uniform across cells and variation is derived from independent statistical sampling. However, as previously explained, variations in counts are rooted in both technical and biological factors, making the use of this distribution inappropriate. ZIP and NB incorporate an additional parameter each ( \(p, \psi\) ) to model the proportion of non-Poisson zeros and overdispersion of variance relative to the mean respectively, whereas ZINB incorporates both. It has been demonstrated that the sampling distribution of UMI counts (plate-based or droplet-based) is not zero inflated, as compared to read counts [ 104 , 110 ]. Thus, if UMIs are used, normalization methods involving zero inflation are not appropriate [ 104 , 111 , 112 ]. NB provides a better approximation to model UMI count data [ 113 ]. It assumes random transcript frequencies and includes a parameter to quantify overdispersion. NB regression models account for cell-specific covariates, for example sequencing depth [ 101 ]. However, researchers have demonstrated that modeling single-cell data with a NB distribution may lead to overfitting [ 102 ]. By comparing these four distributions (Poisson, NB, ZIP, ZINB) using the same mean, Jiang et al. showed that ZINB depicts the highest proportion of zeros ( ∼  64%) whereas NB and ZINB depict bigger probabilities of finding larger values [ 114 ].

Variations of count-based GLMs have been proposed. Hafemeister et al. developed scTransform, a regularized NB regression in which UMI-based gene counts are the response variable and sequencing depth is a covariable [ 102 ]. The Pearson residuals from this regularized NB regression accurately represent the normalized data values and can be used as an input to dimensionality reduction algorithms. scTransform v2 effectively performs variance stabilization and performs better than others for variable gene identification and differential expression analysis [ 115 ]. It is available as an R package and can be used through Seurat toolkit. Researchers also modelled gene counts per cell as a random variable following a zero-inflated NB (ZINB) distribution however, allowing the inclusion of cell and gene level covariates [ 69 ]. This method was named ZIMB-based Wanted Variation Extraction (ZIMB-WaVE) [ 69 ]. Covariates are introduced as parameters in regression equations, and they are inferred through a penalized maximum likelihood procedure. Interestingly, this method can also be used for dimensionality reduction. In another approach, a Gamma Regression Model (GRM) was proposed to reduce the noise in scRNA-seq data [ 116 ]. GRM relies on spike-ins to train a model that fits a GRM between sequencing reads and spike-in concentrations.

Mixed methods

In mixed methods, normalization is performed through the combined implementation of different approaches. Mixed methods are very important in addressing the characteristic bimodal expression pattern of single cells, where abundant genes appear to either have a high expression or to be undetected. These methods can model various sources of technical variability independently using different probability distributions for each. One of the first approaches in using this class of normalization was single cell differential expression (SCDE) proposed by Kharchenko et al. [ 43 ]. SCDE models cell counts as a mixture of two probabilistic processes: a negative binomial corresponding to normal gene amplification and detection, and a Poisson distribution accounting for zero counts. The optimal parameters corresponding to each distribution are then determined through a multinomial logistic regression. SCDE is implemented in the pathway and gene set overdispersion analysis (PAGODA) in which cell-specific error models are used to estimate residual gene expression variance allowing the identification of pathways and gene sets depicting significant coordinated variability [ 117 ].

Similarly, the “Model-based Analysis of Single-cell Transcriptomics” (MAST) [ 118 ], uses a hurdle model implemented as a two-part generalized linear model that simultaneously models the fraction of genes that are detectably expressed in each cell (cellular detection rate: CDR) and the positive gene expression values. MAST models gene expression rate using a logistic regression and a Gaussian distribution is used to model the expression level depending on a gene being expressed in a specific cell. MAST is available as an R library in Bioconductor, and it includes functions for cell filtering, adaptive noise thresholding, univariate differential gene expression with covariate adjustment, gene-gene correlations and co-expression, and gene set enrichment analysis.

Deep learning-based methods

Deep learning, a subclass of machine learning, has been recently used to analyze high-throughput omics data, including scRNA-seq [ 119 ]. Deep learning consists of neural network architectures to discover latent and informative patterns in complex data incorporating thousands of trainable parameters and finds transformations that can effectively normalize counts preserving biological information [ 120 ]. Deep learning approaches for scRNA-seq data normalization include autoencoders, variational autoencoders, and graph neural networks [ 121 ]. Variational autoencoders are a popular class of unsupervised learning methods. For example, single cell variational inference (scVI) learns cell-specific scaling factors by modeling the expression of a gene in a cell as a sample from a ZINB distribution incorporating a batch annotation of each cell and two unobserved random variables [ 122 ]. Deep learning methods proposed for scRNA-seq data analysis have been reviewed by Brendel et al. [ 119 ]. While these emerging methods are promising, independent benchmarking studies comparing their performance against traditional statistical methods are needed.

Batch effect correction methods

Batch effect correction methods aim at removing technical variability derived from experimental design without altering biological variability. Technical variability is systematic, and it is introduced from multiple sources, as previously described. This variability can be confounded as biological and thus, its removal is essential. Methods developed for microarray and bulk RNA-seq data batch correction such as ComBat [ 123 ] and limma [ 124 ] have been used. These methods use a linear regression to model the relationship between batch and gene expression. Other methods, for example ZINB-WaVE extend the linear model based on a zero-inflated negative binomial distribution, accounting for data sparsity, over-dispersion, and non-linear batch effects [ 69 ]. A caveat of linear regression methods is that they assume that the composition of cell subpopulations is identical from batch to batch, making them prone to overcorrection [ 54 ]. However, in scRNA-seq, subpopulation composition is not the same across batches. Therefore, methods relying on the identification of shared cell types across batches have been developed, for example mutual nearest neighbors (MNN) [ 125 ]. This method identifies cells with similar expression profiles between two batches and then estimates a correction vector using the mean differences in gene expression between cells in MNN pairs. The correction vector is then used to align datasets in a shared space, eliminating batch effects. Since the MNN search is performed in a high dimensional space, this method’s caveat is a high memory consumption and CPU runtime. To overcome this problem, numerous algorithms have been developed with the characteristic that the nearest neighbor search is performed in a common reduced dimensional embedding using for example PCA [ 126 ], canonical correlation analysis (CCA) [ 127 ], non-negative matrix factorization (NMF) [ 128 ], and singular value decomposition (SVD) [ 129 ]. Common examples of these methods include fastMNN [ 125 ] and Harmony [ 130 ] which use PCA, Seurat MultiCCA [ 70 ] that captures correlated pairs in a CCA dimensionally reduced space, LIGER [ 131 ] which uses integrative NMF to transform data into a low-dimensional space, and Scanorama [ 132 ] that implements SVD for neighbor search. These unsupervised methods based on MNN may incorrectly match neighboring cells from different clusters across batches, leading to spurious results.

Supervised MNN methods have also been proposed, for example SMNN [ 133 ] and iSMNN [ 134 ]. These methods require the same cell type across batches since they incorporate cell-type specific information to restrict the detection of MNNs. Cell type labels across shared cells in all batches are determined through prior knowledge or inferred by an unsupervised clustering approach. Deep learning-based methods have also become popular for batch effect correction. For example, deepMNN [ 135 ] attempts to remove batch effects using a residual neural network that minimizes batch loss, defined as the sum of the Euclidean distances between MNN pairs in PCA space. However, most of the methods based on MNN only analyze two batches at a time, introducing a batch correction order bias. Furthermore, most of these algorithms remove batch effect and then cluster cells, increasing the probability of missing rare cell types. To solve these issues, the batch alignment of single cell transcriptomics data using a deep metric learning (scDML) model has recently been proposed [ 136 ]. scDML uses deep metric learning to remove batch effects, guided by the initial clusters and MNN information within and between batches.

Normalization performance assessment

Given the prevalence of confounding factors in single-cell experiments, the lack of gold-standard normalization methods and the ambiguity in selecting parameters used in such methods, a set of metrics and guidelines have been proposed to aid in the selection of the most suitable normalization method. Pilot experiments must be performed to evaluate and compare normalization pipelines. A list of benchmarking studies and the metrics used for evaluating normalization methods is included in Additional file 3 . Popular evaluation metrics are described next.

Silhouette width

The silhouette width is an established metric used to determine clustering validity [ 137 ]. However, it has also been used to compare the performance of normalization methods [ 94 , 109 ]. In this method, a silhouette width value is calculated for each cluster using the normalized average distance between its cells to cells belonging to other clusters. The first two or three principal components (PCs) of normalized counts are generally used to calculate the Euclidean distances between cells. Larger silhouette widths correspond to a better separation between clusters. A known mixture of cells should be used to identify the best performing normalization method according to the experimental conditions.

K-nearest neighbor batch effect test

Performance can also be evaluated through the K-nearest neighbor batch-effect test (kBET) and a PC regression [ 94 ]. These two methods are used to evaluate batch effect correction methods. However, authors have tested these methods by sequencing two technical replicates of the same cell type and introducing a known batch effect. Then, data sets have been processed with combinations of imputation, normalization, and batch effect correction methods to determine which pipeline better removes the batch effect preserving biological variability. The kBET relies on the assumption that in a well-mixed replicated experiment, subsets of a fixed number of neighboring cells have the same distribution of batch labels as the complete dataset. To compare batch label distributions, a Pearson’s \({\chi }^{2}\) test is suggested, and a rejection rate is calculated. Intuitively, lower rejection rates are obtained when batch effects have been properly removed and the normalization method is adequate. Alternatively, the scaled explained variance of all PCs significantly correlated with batch effect may also be used to evaluate normalization method performance. In this method, the variance explained by the top 50 PCs is used as a scaling factor. Furthermore, a linear regression between the loadings of each PC and the batch covariate is used to determine a PC’s significance. The amount of scaled variance explained is correlated with the degree of batch effect present in the dataset.

Highly variable genes

Biological heterogeneity in the datasets should be conserved even after the implementation of imputation, normalization, and/or batch-effect correction methods. By comparing Highly Variable Genes (HVG) before and after normalization pipelines, scientists may determine if biological heterogeneity was preserved [ 57 , 94 ]. The variability of a gene is obtained through the squared coefficient of variation (CV 2 ) of normalized read counts across cells. HVGs are those whose variation is greater than a fixed threshold and they account for the heterogeneity between cells. HVGs should be maintained after a implementing a normalization pipeline and no new HVGs should be introduced. A schematic representation of silhouette width, kBET, scaled explained variance, and HGV metrics for evaluating normalization pipelines is depicted in Fig.  3 . Plots such as t-Distributed Stochastic Neighbor Embedding (t-SNE) [ 138 ] and Uniform Manifold Approximation and Projection (UMAP) [ 139 ] are generally used to visualize cell clusters before and after a normalization pipeline.

figure 3

Data-driven metrics used to assess the performance of normalization methods. ( a ) Violin plots depicting the normalized silhouette width obtained by different normalization methods. Larger silhouette widths correspond to a better separation between clusters and thus a better normalization. ( b ) HVGs are identified independently from the raw replicates and the normalized combined datasets. The better normalization performing pipeline will depict the number of HVGs in the intersection of all datasets. ( c ) Schematic representation of the scaled explained variance obtained from the two principal components before and after normalization. Counts in the scenario before normalization were log-transformed

Scone, a tool for systematic comparison of normalization pipelines

An important tool, Scone, was recently developed by Cole et al. for the comparison of normalization pipelines [ 140 ]. Scone is a flexible and modular framework for preprocessing scRNA-seq datasets using multiple normalization strategies and systematically evaluating them through a panel of data-driven metrics. Interestingly, scone evaluates the performance of a range of normalization pipelines and ranks them according to performance metrics, including for example silhouette width. Moreover, scone can incorporate a user-defined normalization pipeline.

Recently, over 1000 tools for analyzing scRNA-seq data have been developed [ 141 , 142 ]. Based on the procedure, Zappia et al., separate single-cell data analysis into four analysis phases: data acquisition, data cleaning, cell assignment, and gene identification [ 142 ]. The majority of these tools are developed in R or python, and more and more of them will be designed in python in the future [ 141 ]. Here, we introduce some toolkits which can perform complete analysis of scRNA-seq datasets (Additional file 4 ).

Seurat is widely used by researchers, and it starts from a gene expression matrix (read counts) ( https://satijalab.org/seurat/ ). It can compare scRNA-seq datasets from different conditions, technologies, or species. Seurat has two main normalization methods (LogNormalize [ 143 ] and sctransform [ 102 ]). For the integration of different scRNA-seq datasets, Seurat has two methods (CCA (canonical correlation analysis) [ 143 ] and RPCA (reciprocal PCA)) [ 144 ] to remove the batch effect. RPCA is an optimization for large numbers of samples and cells [ 144 ]. Seurat can provide the clusters from all cells, the expression of marker genes, and differential expression genes among the clusters. Furthermore, Seurat results can be transferred to other platforms or pipelines, for example, Monocle’s pseudotime analysis [ 145 , 146 , 147 ], RNA velocity analysis [ 148 ], single cell regulation network analysis (SCENIC) [ 149 ], and cell-cell communication analysis (e.g., CellChat [ 150 ]).

SCANPY is another similar toolkit for scRNA-seq analysis [ 151 ]. It is a Python-based tool that starts from a gene expression matrix. It integrates many scRNA-seq analysis methods, such as gene/cell preprocessing, clustering, pseudotime and trajectory inference, and other analysis. The normalization of SCANPY is only based on library size. SCANPY can use four algorithms to remove batch variations, e.g., Regress_Out [ 151 ], ComBat [ 123 ], Scanorama [ 132 ] and MNN_Correct [ 125 , 152 ]. Compared with R-based Seurat, SCANPY based on Python will have more processing efficiency and running speed [ 152 ]. SCANPY has integrated PAGA [ 153 ] in the toolkits, so it can directly perform the trajectory analysis.

The use of these toolkits requires programming experience. With the development of scRNA-seq data analysis, some graphical user interfaces analysis tools have also been developed, such as SCorange [ 154 ], SCTK (Single Cell Toolkit) [ 155 ], Granatum [ 156 ], and ASAP (Automated Single-cell Analysis Pipeline) [ 157 ]. These web-based analysis tools integrate several normalizations and batch-effect removing methods. For example, Granatum has four normalization methods (e.g., quantile normalization, geometric mean normalization, size-factor normalization, and Voom) and two batch-effect removing methods (e.g., ComBat and Median alignment) [ 156 ]. SCTK is built in singleCellTK R package, however, SCTK could analyze sc/snRNA-seq data with graphical user interface ( https://sctk.bu.edu/ ) by Shiny APP [ 155 ]. It includes several normalization methods from Seurat (e.g., LogNormalize, Sctransform) and Scater (e.g., CPM, LogNormCounts), and batch-effect removing methods (e.g., MNN, scMerge, Scanorama, and ComBatsSeq).

Major advances in single cell sequencing technologies have greatly improved our understanding of the complexity of organs and tissues and the dynamism of biological processes. However, a critical step in scRNA-seq data analysis is normalization, a process that aims at making gene counts comparable within and between cells, and among biological replicates. Recent pioneering work by Choudhary and Satija demonstrated that the degree of overdispersion within 59 scRNA-seq datasets varied widely across datasets, systems, and gene abundances, suggesting that the estimation of parameters is dataset-specific [ 115 ]. Thus, the selection of a normalization method is not trivial, and it has a direct impact on downstream analysis. For example, a study by Squair et al. [ 158 ] found that the most frequently used methods for differential expression analysis (including each methods’ normalization) identified differentially expressed genes even when biological differences were absent. Authors demonstrated a systematic tendency of single-cell methods to identify highly expressed unchanged genes as differentially expressed. Moreover, false differentially expressed genes will affect clustering and trajectory analysis. These results underscore the importance of selecting normalization methods that adequately account for technical noise and variability between biological replicates. Furthermore, another intriguing observation demonstrated by benchmarking studies is that normalization methods perform differently depending on the input dataset. This is likely due to differences in technical noise sources and to the heterogeneity of samples. Instead of comparing the normalization performance on numerous real world or simulated datasets, benchmarking studies should use well designed mixture control experiments as previously proposed [ 109 ].

In scRNA-seq count data, cell-to-cell biological variation is related to cell type and state and is encoded in cellular transcriptomes. This heterogeneity is the main source of interest, and it should be modeled to include covariates that influence gene expression. To account for these sources of technical variability, normalization methods depict different approaches. Global normalization methods estimate a size factor for each cell to account for differences in library size. Since the size factor is applied to all genes of a cell, biological variability may be affected. Global scaling normalization methods that rely on pre-clustering or pooling cells with similar library sizes and estimating a pool-specific factor, for example scran, perform better as demonstrated by benchmarking studies [ 94 , 95 ]. In contrast, generalized linear models use probability distributions to model the contribution of systematic and random components to a response variable, corresponding to gene counts. In this way, covariables account for technical variation, such a sequencing depth, and they are regressed out while the true biological variability is expected to be captured in the response variable. Mixed methods extend linear models by addressing each technical variability source with an independent probability distribution or error model. Emerging deep learning-based methods use neural network architectures to learn underlying patterns of gene expression with complex and non-linear relationships. These methods can efficiently model technical variation sources including batch effects and find optimal transformations that can normalize counts preserving biological variability. Studies using mixture control experiments for benchmarking deep learning-based normalization methods are still needed.

The selection of the most appropriate normalization method is strongly dependent on the experimental design, protocol and platform, and assumptions regarding technical and biological variability need to be made. Thus, there is no better performing normalization pipeline. Instead, pilot experiments should be made to evaluate the performance of a series of normalization pipelines using recommended metrics. These experiments should closely resemble the final experiment, for instance, the same experimental platform and sequencing technology should be used. The selection of the better suited normalization method may be performed through the assessment of data-driven metrics described herein. Moreover, the use of frameworks such as Scone are also recommended to simultaneously evaluate the performance of numerous normalization pipelines.

Further work is needed to develop new tools that perform accurate diagnostics concerning the validity of statistical assumptions under the observed data. Novel approximations such as the introduction of molecular spikes for more accurate molecule counting have the potential of becoming a gold-standard and reducing the technical variability, facilitating the selection of a normalization method.

Data availability

There are no new data associated with this article.

Abbreviations

single-cell RNA-sequencing

fluorescence/magnetic-activated cell sorting

integrated fluidics circuits

reverse transcription

External RNA Control Consortium

unique molecule identifier

in vitro transcription

template-switching oligonucleotides

single-cell tagged reverse transcription sequencing

switching mechanisms at the 5’-end of the RNA transcript sequencing

cell expression by linear amplification and sequencing

massively parallel single-cell RNA sequencing

indexing droplets RNA sequencing

generalized linear models

transcripts or counts per million

reads per kilobase of exon model per million mapped reads

Internal Spike-in-like-genes normalization

trimmed mean of M values

Bayesian Analysis of Single-Cell Sequencing Data

zero-inflated NB

ZIMB-based Wanted Variation Extraction

Gamma Regression Model

single cell differential expression

pathway and gene set overdispersion analysis

Model-based Analysis of Single-cell Transcriptomics

mutual nearest neighbors

canonical correlation analysis

non-negative matrix factorization

singular value decomposition

principal components

K-nearest neighbor batch-effect test

t-Distributed Stochastic Neighbor Embedding

Highly Variable Genes

Uniform Manifold Approximation and Projection

reciprocal PCA

single cell regulation network analysis

Single Cell Toolkit

Automated Single-cell Analysis Pipeline

polymerase chain reaction

potassium acetate

magnesium acetate

not available

Choi YH, Kim JK. Dissecting Cellular Heterogeneity using single-cell RNA sequencing. Mol Cells. 2019;42(3):189–99.

CAS   PubMed   PubMed Central   Google Scholar  

He J, Babarinde IA, Sun L, Xu S, Chen R, Shi J, Wei Y, Li Y, Ma G, Zhuang Q, et al. Identifying transposable element expression dynamics and heterogeneity during development at the single-cell level with a processing pipeline scTE. Nat Commun. 2021;12(1):1456.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Wilkerson BA, Zebroski HL, Finkbeiner CR, Chitsazan AD, Beach KE, Sen N, Zhang RC, Bermingham-McDonogh O. Novel cell types and developmental lineages revealed by single-cell RNA-seq analysis of the mouse crista ampullaris. Elife 2021, 10.

Jerber J, Seaton DD, Cuomo ASE, Kumasaka N, Haldane J, Steer J, Patel M, Pearce D, Andersson M, Bonder MJ, et al. Population-scale single-cell RNA-seq profiling across dopaminergic neuron differentiation. Nat Genet. 2021;53(3):304–12.

Vallejos CA, Richardson S, Marioni JC. Beyond comparisons of means: understanding changes in gene expression at the single-cell level. Genome Biol. 2016;17:70.

Article   PubMed   PubMed Central   Google Scholar  

van den Brink SC, Sage F, Vertesy A, Spanjaard B, Peterson-Maduro J, Baron CS, Robin C, van Oudenaarden A. Single-cell sequencing reveals dissociation-induced gene expression in tissue subpopulations. Nat Methods. 2017;14(10):935–6.

Article   PubMed   Google Scholar  

Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Rep. 2012;2(3):666–73.

Article   CAS   PubMed   Google Scholar  

Basu S, Campbell HM, Dittel BN, Ray A. Purification of specific cell population by fluorescence activated cell sorting (FACS). J Vis Exp 2010(41).

Schmitz B, Radbruch A, Kummel T, Wickenhauser C, Korb H, Hansmann ML, Thiele J, Fischer R. Magnetic activated cell sorting (MACS)--a new immunomagnetic method for megakaryocytic cell isolation: comparison of different separation techniques. Eur J Haematol. 1994;52(5):267–75.

Prakadan SM, Shalek AK, Weitz DA. Scaling by shrinking: empowering single-cell ‘omics’ with microfluidic devices. Nat Rev Genet. 2017;18(6):345–61.

Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.

Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly parallel genome-wide expression profiling of individual cells using Nanoliter droplets. Cell. 2015;161(5):1202–14.

Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049.

Gierahn TM, Wadsworth MH 2nd, Hughes TK, Bryson BD, Butler A, Satija R, Fortune S, Love JC, Shalek AK. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat Methods. 2017;14(4):395–8.

External RNACC. Proposed methods for testing and selecting the ERCC external RNA controls. BMC Genomics. 2005;6:150.

Article   Google Scholar  

Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16(3):133–45.

Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, Lonnerberg P, Linnarsson S. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11(2):163–6.

Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6(5):377–82.

Aicher TP, Carroll S, Raddi G, Gierahn T, Wadsworth MH 2nd, Hughes TK, Love C, Shalek AK. Seq-Well: a Sample-Efficient, portable Picowell platform for massively parallel single-cell RNA sequencing. Methods Mol Biol. 2019;1979:111–32.

Ramskold D, Luo S, Wang YC, Li R, Deng Q, Faridani OR, Daniels GA, Khrebtukova I, Loring JF, Laurent LC, et al. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nat Biotechnol. 2012;30(8):777–82.

Picelli S, Bjorklund AK, Faridani OR, Sagasser S, Winberg G, Sandberg R. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat Methods. 2013;10(11):1096–8.

Yanai I, Hashimshony T. CEL-Seq2-Single-cell RNA sequencing by Multiplexed Linear amplification. Methods Mol Biol. 2019;1979:45–56.

Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A, et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science. 2014;343(6172):776–9.

Zilionis R, Nainys J, Veres A, Savova V, Zemmour D, Klein AM, Mazutis L. Single-cell barcoding and sequencing using droplet microfluidics. Nat Protoc. 2017;12(1):44–73.

Ziegenhain C, Vieth B, Parekh S, Reinius B, Guillaumet-Adkins A, Smets M, Leonhardt H, Heyn H, Hellmann I, Enard W. Comparative analysis of single-cell RNA sequencing methods. Mol Cell. 2017;65(4):631–e643634.

Tirosh I, Izar B, Prakadan SM, Wadsworth MH 2nd, Treacy D, Trombetta JJ, Rotem A, Rodman C, Lian C, Murphy G, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352(6282):189–96.

Tirosh I, Venteicher AS, Hebert C, Escalante LE, Patel AP, Yizhak K, Fisher JM, Rodman C, Mount C, Filbin MG, et al. Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature. 2016;539(7628):309–13.

Giustacchini A, Thongjuea S, Barkas N, Woll PS, Povinelli BJ, Booth CAG, Sopp P, Norfo R, Rodriguez-Meira A, Ashley N, et al. Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia. Nat Med. 2017;23(6):692–702.

Mereu E, Lafzi A, Moutinho C, Ziegenhain C, McCarthy DJ, Alvarez-Varela A, Batlle E, Sagar, Grun D, Lau JK, et al. Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat Biotechnol. 2020;38(6):747–55.

Picelli S, Faridani OR, Bjorklund AK, Winberg G, Sagasser S, Sandberg R. Full-length RNA-seq from single cells using Smart-seq2. Nat Protoc. 2014;9(1):171–81.

Hagemann-Jensen M, Ziegenhain C, Chen P, Ramskold D, Hendriks GJ, Larsson AJM, Faridani OR, Sandberg R. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat Biotechnol. 2020;38(6):708–14.

Hagemann-Jensen M, Ziegenhain C, Sandberg R. Scalable single-cell RNA sequencing from full transcripts with Smart-seq3xpress. Nat Biotechnol. 2022;40(10):1452–7.

Hahaut V, Pavlinic D, Carbone W, Schuierer S, Balmer P, Quinodoz M, Renner M, Roma G, Cowan CS, Picelli S. Fast and highly sensitive full-length single-cell RNA sequencing using FLASH-seq. Nat Biotechnol. 2022;40(10):1447–51.

Tian L, Jabbari JS, Thijssen R, Gouil Q, Amarasinghe SL, Voogd O, Kariyawasam H, Du MRM, Schuster J, Wang C, et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biol. 2021;22(1):310.

Pan L, Dinh HQ, Pawitan Y, Vu TN. Isoform-level quantification for single-cell RNA sequencing. Bioinformatics. 2022;38(5):1287–94.

STARsolo. Accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. bioRxiv; 2021.

Svensson V, Natarajan KN, Ly LH, Miragaia RJ, Labalette C, Macaulay IC, Cvejic A, Teichmann SA. Power analysis of single-cell RNA-sequencing experiments. Nat Methods. 2017;14(4):381–7.

Wu AR, Neff NF, Kalisky T, Dalerba P, Treutlein B, Rothenberg ME, Mburu FM, Mantalas GL, Sim S, Clarke MF, et al. Quantitative assessment of single-cell RNA-sequencing methods. Nat Methods. 2014;11(1):41–6.

Zhang X, Li T, Liu F, Chen Y, Yao J, Li Z, Huang Y, Wang J. Comparative analysis of Droplet-based Ultra-high-throughput single-cell RNA-Seq systems. Mol Cell. 2019;73(1):130–e142135.

Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet. 2021;53(6):770–7.

Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, Wold BJ. From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 2014;24(3):496–510.

Zenklusen D, Larson DR, Singer RH. Single-RNA counting reveals alternative modes of gene expression in yeast. Nat Struct Mol Biol. 2008;15(12):1263–71.

Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nat Methods. 2014;11(7):740–2.

Cabada MO, Darnbrough C, Ford PJ, Turner PC. Differential accumulation of two size classes of poly(A) associated with messenger RNA during oogenesis in Xenopus laevis. Dev Biol. 1977;57(2):427–39.

Weill L, Belloc E, Bava FA, Mendez R. Translational control by changes in poly(A) tail length: recycling mRNAs. Nat Struct Mol Biol. 2012;19(6):577–85.

Andrews TS, Kiselev VY, McCarthy D, Hemberg M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc. 2021;16(1):1–9.

Shalek AK, Satija R, Shuga J, Trombetta JJ, Gennert D, Lu D, Chen P, Gertner RS, Gaublomme JT, Yosef N, et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature. 2014;510(7505):363–9.

Islam S, Kjallquist U, Moliner A, Zajac P, Fan JB, Lonnerberg P, Linnarsson S. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 2011;21(7):1160–7.

Archer N, Walsh MD, Shahrezaei V, Hebenstreit D. Modeling enzyme Processivity reveals that RNA-Seq libraries are biased in characteristic and correctable ways. Cell Syst. 2016;3(5):467–e479412.

Von Hippel PH, Fairfield FR, Dolejsi MK. On the processivity of polymerases. Ann N Y Acad Sci. 1994;726:118–31.

Article   CAS   Google Scholar  

Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G + C)-biased genomes. Nat Methods. 2009;6(4):291–5.

Mamanova L, Andrews RM, James KD, Sheridan EM, Ellis PD, Langford CF, Ost TW, Collins JE, Turner DJ. FRT-seq: amplification-free, strand-specific transcriptome sequencing. Nat Methods. 2010;7(2):130–2.

Lahens NF, Kavakli IH, Zhang R, Hayer K, Black MB, Dueck H, Pizarro A, Kim J, Irizarry R, Thomas RS, et al. IVT-seq reveals extreme bias in RNA sequencing. Genome Biol. 2014;15(6):R86.

Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, Chen J. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21(1):12.

Lotfollahi M, Naghipourfar M, Luecken MD, Khajavi M, Buttner M, Wagenstetter M, Avsec Z, Gayoso A, Yosef N, Interlandi M, et al. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol. 2022;40(1):121–30.

Luecken MD, Buttner M, Chaichoompu K, Danese A, Interlandi M, Mueller MF, Strobl DC, Zappia L, Dugas M, Colome-Tatche M, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods. 2022;19(1):41–50.

Brennecke P, Anders S, Kim JK, Kolodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods. 2013;10(11):1093–5.

Bacher R, Kendziorski C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 2016;17:63.

Katayama S, Tohonen V, Linnarsson S, Kere J. SAMstrt: statistical test for differential expression in single-cell transcriptome with spike-in normalization. Bioinformatics. 2013;29(22):2943–5.

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.

Grun D, van Oudenaarden A. Design and analysis of single-cell sequencing experiments. Cell. 2015;163(4):799–810.

Kaern M, Elston TC, Blake WJ, Collins JJ. Stochasticity in gene expression: from theories to phenotypes. Nat Rev Genet. 2005;6(6):451–64.

Raj A, van Oudenaarden A. Nature, nurture, or chance: stochastic gene expression and its consequences. Cell. 2008;135(2):216–26.

Suter DM, Molina N, Gatfield D, Schneider K, Schibler U, Naef F. Mammalian genes are transcribed with widely different bursting kinetics. Science. 2011;332(6028):472–4.

Fuda NJ, Ardehali MB, Lis JT. Defining mechanisms that regulate RNA polymerase II transcription in vivo. Nature. 2009;461(7261):186–92.

Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94.

Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;12:480.

Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013;14(6):671–83.

Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun. 2018;9(1):284.

Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20.

Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19(4):562–78.

Lahnemann D, Koster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020;21(1):31.

van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, et al. Recovering gene interactions from single-cell data using data Diffusion. Cell. 2018;174(3):716–e729727.

Andrews TS, Hemberg M. False signals induced by single-cell imputation. F1000Res. 2018;7:1740.

Linderman GC, Zhao J, Roulis M, Bielecki P, Flavell RA, Nadler B, Kluger Y. Zero-preserving imputation of single-cell RNA-seq data. Nat Commun. 2022;13(1):192.

Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, Zhang NR. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods. 2018;15(7):539–42.

Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat Commun. 2018;9(1):997.

Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 2020;21(1):218.

Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak MW, Gaffney DJ, Elo LL, Zhang X, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13.

Jiang R, Sun T, Song D, Li JJ. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol. 2022;23(1):31.

Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.

Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8.

Phipson B, Zappia L, Oshlack A. Gene length and detection bias in single cell RNA sequencing protocols. F1000Res. 2017;6:595.

Lin L, Song M, Jiang Y, Zhao X, Wang H, Zhang L. Normalizing single-cell RNA sequencing data with internal spike-in-like genes. NAR Genom Bioinform. 2020;2(3):lqaa059.

Ziegenhain C, Hendriks GJ, Hagemann-Jensen M, Sandberg R. Molecular spikes: a gold standard for single-cell RNA counting. Nat Methods. 2022;19(5):560–6.

Love MI, Huber W, Anders S. Moderated estimation of Fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.

Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinformatics. 2019;20(1):40.

Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17:75.

Vallejos CA, Marioni JC, Richardson S. BASiCS: Bayesian Analysis of Single-Cell Sequencing Data. PLoS Comput Biol. 2015;11(6):e1004333.

Eling N, Richard AC, Richardson S, Marioni JC, Vallejos CA. Correcting the Mean-Variance dependency for Differential Variability Testing using single-cell RNA sequencing data. Cell Syst. 2018;7(3):284–e294212.

Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 2016;5:2122.

PubMed   PubMed Central   Google Scholar  

Paulson JN, Chen CY, Lopes-Ramos CM, Kuijjer ML, Platig J, Sonawane AR, Fagny M, Glass K, Quackenbush J. Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data. BMC Bioinformatics. 2017;18(1):437.

Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods. 2019;16(1):43–9.

Vieth B, Parekh S, Ziegenhain C, Enard W, Hellmann I. A systematic evaluation of single cell RNA-seq analysis pipelines. Nat Commun. 2019;10(1):4667.

Ahlmann-Eltze C, Huber W. Comparison of transformations for single-cell RNA-seq data. Nat Methods. 2023;20(5):665–72.

Bengtsson M, Stahlberg A, Rorsman P, Kubista M. Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels. Genome Res. 2005;15(10):1388–92.

Taniguchi Y, Choi PJ, Li GW, Chen H, Babu M, Hearn J, Emili A, Xie XS. Quantifying E. Coli proteome and transcriptome with single-molecule sensitivity in single cells. Science. 2010;329(5991):533–8.

Wills QF, Livak KJ, Tipping AJ, Enver T, Goldson AJ, Sexton DW, Holmes C. Single-cell gene expression analysis reveals genetic associations masked in whole-tissue experiments. Nat Biotechnol. 2013;31(8):748–52.

Vu TN, Wills QF, Kalari KR, Niu N, Wang L, Rantalainen M, Pawitan Y. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics. 2016;32(14):2128–35.

Grun D, Kester L, van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014;11(6):637–40.

Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20(1):296.

Silverman JD, Roche K, Mukherjee S, David LA. Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J. 2020;18:2789–98.

Svensson V. Droplet scRNA-seq is not zero-inflated. Nat Biotechnol. 2020;38(2):147–50.

Nelder JA, Wedderburn RWM. Generalized Linear models. J Royal Stat Soc Ser (General). 1972;135(3):370–84.

Yip SH, Wang P, Kocher JA, Sham PC, Wang J. Linnorm: improved statistical analysis for single cell RNA-seq expression data. Nucleic Acids Res. 2017;45(22):e179.

Borella M, Martello G, Risso D, Romualdi C. PsiNorm: a scalable normalization for single-cell RNA-seq data. Bioinformatics. 2021;38(1):164–72.

Bacher R, Chu LF, Leng N, Gasch AP, Thomson JA, Stewart RM, Newton M, Kendziorski C. SCnorm: robust normalization of single-cell RNA-seq data. Nat Methods. 2017;14(6):584–6.

Tian L, Dong X, Freytag S, Le Cao KA, Su S, JalalAbadi A, Amann-Zalcenstein D, Weber TS, Seidi A, Jabbari JS, et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Methods. 2019;16(6):479–87.

Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biol. 2018;19(1):70.

Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 2019;20(1):295.

Cao Y, Kitanovski S, Kuppers R, Hoffmann D. UMI or not UMI, that is the question for scRNA-seq zero-inflation. Nat Biotechnol. 2021;39(2):158–9.

Kim TH, Zhou X, Chen M. Demystifying “drop-outs” in single-cell UMI data. Genome Biol. 2020;21(1):196.

Jiang R, Sun T, Song D, Li J. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol. 2022;23(1);31.

Choudhary S, Satija R. Comparison and evaluation of statistical error models for scRNA-seq. Genome Biol. 2022;23(1):27.

Ding B, Zheng L, Zhu Y, Li N, Jia H, Ai R, Wildberg A, Wang W. Normalization and noise reduction for single cell RNA-seq experiments. Bioinformatics. 2015;31(13):2225–7.

Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, Herman JL, Kaper F, Fan JB, Zhang K, Chun J, et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods. 2016;13(3):241–4.

Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, Slichter CK, Miller HW, McElrath MJ, Prlic M, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015;16:278.

Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of deep learning on single-cell RNA sequencing data analysis: a review. Genomics Proteom Bioinf. 2022;20(5):814–35.

Ma Q, Xu D. Deep learning shapes single-cell data analysis. Nat Rev Mol Cell Biol. 2022;23(5):303–4.

Erfanian N, Heydari AA, Feriz AM, Ianez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, et al. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother. 2023;165:115077.

Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15(12):1053–8.

Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.

Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 2018;36(5):421–7.

Jolliffe IT. Principal component analysis and Factor Analysis. Principal component analysis. New York, NY: Springer New York; 2002. pp. 150–66.

Google Scholar  

Hardoon DR, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 2004;16(12):2639–64.

Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401(6755):788–91.

Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A. 2000;97(18):10101–6.

Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh PR, Raychaudhuri S. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16(12):1289–96.

Welch JD, Kozareva V, Ferreira A, Vanderburg C, Martin C, Macosko EZ. Single-cell multi-omic integration compares and contrasts features of Brain Cell Identity. Cell. 2019;177(7):1873–e18871817.

Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat Biotechnol. 2019;37(6):685–91.

Yang Y, Li G, Qian H, Wilhelmsen KC, Shen Y, Li Y. SMNN: batch effect correction for single-cell RNA-seq data via supervised mutual nearest neighbor detection. Brief Bioinform 2021, 22(3).

Yang Y, Li G, Xie Y, Wang L, Lagler TM, Yang Y, Liu J, Qian L, Li Y. iSMNN: batch effect correction for single-cell RNA-seq data via iterative supervised mutual nearest neighbor refinement. Brief Bioinform 2021, 22(5).

Zou B, Zhang T, Zhou R, Jiang X, Yang H, Jin X, Bai Y. deepMNN: deep learning-based single-cell RNA sequencing data batch correction using mutual nearest neighbors. Front Genet. 2021;12:708981.

Yu X, Xu X, Zhang J, Li X. Batch alignment of single-cell transcriptomics data using deep metric learning. Nat Commun. 2023;14(1):960.

Peter J, Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.

van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–605.

McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Preprint at https://doi.org/arxivorg/abs/180203426 2018.

Cole MB, Risso D, Wagner A, DeTomaso D, Ngai J, Purdom E, Dudoit S, Yosef N. Performance Assessment and Selection of normalization procedures for single-cell RNA-Seq. Cell Syst. 2019;8(4):315–e328318.

Zappia L, Theis FJ. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol. 2021;22(1):301.

Zappia L, Phipson B, Oshlack A. Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database. PLoS Comput Biol. 2018;14(6):e1006245.

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive Integration of Single-Cell Data. Cell. 2019;177(7):1888–902. e1821.

Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zager M, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–e35873529.

Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32(4):381–6.

Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA quantification and differential analysis with Census. Nat Methods. 2017;14(3):309–15.

Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, Trapnell C. Reversed graph embedding resolves complex single-cell trajectories. Nat Methods. 2017;14(10):979–82.

La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V, Lidschreiber K, Kastriti ME, Lonnerberg P, Furlan A, et al. RNA velocity of single cells. Nature. 2018;560(7719):494–8.

Aibar S, Gonzalez-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J, et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017;14(11):1083–6.

Jin S, Guerrero-Juarez CF, Zhang L, Chang I, Ramos R, Kuan CH, Myung P, Plikus MV, Nie Q. Inference and analysis of cell-cell communication using CellChat. Nat Commun. 2021;12(1):1088.

Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19(1):15.

Li J, Yu C, Ma L, Wang J, Guo G. Comparison of Scanpy-based algorithms to remove the batch effect from single-cell RNA-seq data. Cell Regen. 2020;9(1):10.

Wolf FA, Hamey FK, Plass M, Solana J, Dahlin JS, Gottgens B, Rajewsky N, Simon L, Theis FJ. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 2019;20(1):59.

Strazar M, Zagar L, Kokosar J, Tanko V, Erjavec A, Policar PG, Staric A, Demsar J, Shaulsky G, Menon V, et al. scOrange-a tool for hands-on training of concepts from single-cell data analytics. Bioinformatics. 2019;35(14):i4–12.

Hong R, Koga Y, Bandyadka S, Leshchyk A, Wang Y, Akavoor V, Cao X, Sarfraz I, Wang Z, Alabdullatif S, et al. Comprehensive generation, visualization, and reporting of quality control metrics for single-cell RNA sequencing data. Nat Commun. 2022;13(1):1688.

Zhu X, Wolfgruber TK, Tasato A, Arisdakessian C, Garmire DG, Garmire LX. Granatum: a graphical single-cell RNA-Seq analysis pipeline for genomics scientists. Genome Med. 2017;9(1):108.

Gardeux V, David FPA, Shajkofci A, Schwalie PC, Deplancke B. ASAP: a web-based platform for the analysis and interactive visualization of single-cell RNA-seq data. Bioinformatics. 2017;33(19):3123–5.

Squair JW, Gautier M, Kathe C, Anderson MA, James ND, Hutson TH, Hudelle R, Qaiser T, Matson KJE, Barraud Q, et al. Confronting false discoveries in single-cell differential expression. Nat Commun. 2021;12(1):5692.

Download references

Acknowledgements

Not applicable.

This work was supported by grants from the NIH, United States (R01 NS088353, R21 NS113068, and 2RF1NS093652), Amy and Edward Knight Fund-a program of the UTHSC Senator Lloyd Bentsen Stroke Center, The Staman Ogilvie Fund-Memorial Hermann Foundation, and Mission Connect-a program of The Institute for Rehabilitation and Research (TIRR) Foundation.

Author information

Raquel Cuevas-Diaz Duran and Haichao Wei contributed equally.

Authors and Affiliations

Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, 64710, Mexico

Raquel Cuevas-Diaz Duran

The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA

Haichao Wei & Jiaqian Wu

Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA

MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, 77030, USA

You can also search for this author in PubMed   Google Scholar

Contributions

The corresponding authors RCDD and JQW had the original idea of writing a review article on normalization methods of single-cell RNA-seq datasets. RCDD and HW performed the literature review, writing, and conceptualization of figures and tables. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Raquel Cuevas-Diaz Duran or Jiaqian Wu .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, supplementary material 3, supplementary material 4, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Cuevas-Diaz Duran, R., Wei, H. & Wu, J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genomics 25 , 444 (2024). https://doi.org/10.1186/s12864-024-10364-5

Download citation

Received : 02 September 2023

Accepted : 29 April 2024

Published : 06 May 2024

DOI : https://doi.org/10.1186/s12864-024-10364-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Single-cell sequencing
  • Normalization
  • Technical variability
  • Biological variability

BMC Genomics

ISSN: 1471-2164

essay on database normalization

Help | Advanced Search

Computer Science > Machine Learning

Title: xlstm: extended long short-term memory.

Abstract: In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Photo Essay: My Spring 2024 Semester at CDS

essay on database normalization

Hi there! My name is Isabella Boncser, and I'm currently a sophomore in the six-year Accelerated BS/DPT program in Boston University's Sargent College (2026/2028). In addition to my academic pursuits, I have a passion for photography, and am currently the CDS student event photographer. I love capturing student life within CDS, whether that be taking pictures of students studying in the building on a rainy day, attending 24-hour civic tech hackathon on the 17th floor, or a faculty and staff appreciation event. Over this past semester, I had the honor of working with the CDS communications team, led by Maureen McCarthy , director, and Alessandra Augusto , events & communications manager.

I was asked to highlight some of my best and brightest work from the semester. The following images were captured this spring, and are some of my favorite images. They showcase the versatility of student life within CDS and BU Spark !

essay on database normalization

BU Spark! hosted a Tech For Change Civic Tech Hackathon , where students spent 24 hours at BU to developed a new project with teamwork and technical skills at the forefront. I had the opportunity to meet students from 19 different schools, all of whom spent (literally) day and night on the 17th floor of the Center for Computing & Data Sciences working together and using their hacking skills to create a difference in the world. Pictured here are two students celebrating after discussing their individual projects and asking for some advice regarding their presentations.

essay on database normalization

CDS serves as home for a variety of people and their furry friends! This image shows Miss Belle, the beautiful English Setter (who loves birds) who shares office space with her owner, Chris DeVits, CDS Director of Administration.

essay on database normalization

The Center for Computing & Data Sciences truly has a place for everyone at BU. The main level has become the campus living room, where students can meet to chat over coffee, or catch up on emails on the staircase. On a rainy day, students can find a " cozy corner " and focus on their work in a relaxing environment. This is a glimpse of the "sit steps" - the large staircase with over two dozen conversation spaces that has become popular for students to relax and get some work done between classes.

essay on database normalization

You may have heard people refer to the Center for Computing & Data Sciences as the "Jenga Building" because of its Jenga-like architecture. The building, which is home to the Faculty of Computing & Data Sciences, the Departments of Mathematics & Statistics and Computer Sciences, and the renowned Rafik B. Hariri Institute for Computing and Computation Science & Engineering, embraces its beautiful yet fun architecture while focusing on community! Next time you are craving a fun study break, join the CDS Events Team for a night of Jenga and try some delicious popcorn!

essay on database normalization

Driving down Commonwealth Avenue, the building stands out amongst its peers and shines bright along the Boston city skyline. Illuminating the streets during dusk, the building is one of my favorites the photograph. The 17th floor is home to many events hosted by CDS faculty and staff, as well as the general BU community.

essay on database normalization

The students pictured had been working tirelessly on their TFC Civic Tech Hackathon project. This photo exemplifies teamwork, collaboration, and partnership. Although students were working on their projects for 24 hours on the 17th floor of CDS, they were all smiles for the camera during final presentations!

essay on database normalization

Yoga at the Top of BU has become a staple for students to come and enjoy a one-hour yoga session. The class is open to all students across BU, and is a great way to take a study break and get your body moving. If you are a zen master, or have never taken a yoga class before, come join us for the next session!

essay on database normalization

The BU Spark! team gathered for a group picture during the Civic Tech Hackathon which took place on the 17th floor in February. Over the Spring 2024 semester, I've had the pleasure of getting to know the ambassadors from each track, and their passion for their work within the BU community is truly inspiring. BU Spark! hosts numerous events, talks, and community-building programs like Cookie O'clock, town halls, and much more. Visit the BU Spark! space on the second floor to learn more about their involvement on campus!

essay on database normalization

Computational Humanities, Arts & Social Sciences ( CHASS ) hosted a variety of tutorials ranging from "An Analysis on Emerson's Work" to large language model discussions throughout the Spring 2024 semester. These sessions are a great way to learn about the data science industry and how your skills will be used in the real world. Check out the CHASS video tutorial library on YouTube .

I am heading to Dublin, Ireland to live and study abroad for the Fall 2024 semester! I am so thankful to Maureen McCarthy who gave me the opportunity to work with and celebrate the CDS community. I would also like to shoutout Sebastian Bak (QST'25) who recommended the position to me, and spoke so highly of the CDS community!

Share this:

  • Share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Twitter (Opens in new window)

View all posts

Main Navigation

  • Contact NeurIPS
  • Code of Ethics
  • Code of Conduct
  • Create Profile
  • Journal To Conference Track
  • Diversity & Inclusion
  • Proceedings
  • Future Meetings
  • Exhibitor Information
  • Privacy Policy

NeurIPS 2024 Datasets and Benchmarks Track

If you'd like to become a reviewer for the track, or recommend someone, please use this form .

The Datasets and Benchmarks track serves as a venue for high-quality publications, talks, and posters on highly valuable machine learning datasets and benchmarks, as well as a forum for discussions on how to improve dataset development. Datasets and benchmarks are crucial for the development of machine learning methods, but also require their own publishing and reviewing guidelines. For instance, datasets can often not be reviewed in a double-blind fashion, and hence full anonymization will not be required. On the other hand, they do require additional specific checks, such as a proper description of how the data was collected, whether they show intrinsic bias, and whether they will remain accessible.

The previous editions of the Datasets and Benchmarks track were highly successful; you can view the accepted papers from 2021 , 2002 , and 2023 , and the winners of the best paper awards 2021 , 2022 and 2023

CRITERIA. W e are aiming for an equally stringent review as the main conference, yet better suited to datasets and benchmarks. Submissions to this track will be reviewed according to a set of criteria and best practices specifically designed for datasets and benchmarks , as described below. A key criterion is accessibility: datasets should be available and accessible , i.e. the data can be found and obtained without a personal request to the PI, and any required code should be open source. We encourage the authors to use Croissant format ( https://mlcommons.org/working-groups/data/croissant/ ) to document their datasets in machine readable way.   Next to a scientific paper, authors should also submit supplementary materials such as detail on how the data was collected and organised, what kind of information it contains, how it should be used ethically and responsibly, as well as how it will be made available and maintained.

RELATIONSHIP TO NeurIPS.  Submissions to the track will be part of the main NeurIPS conference , presented alongside the main conference papers. Accepted papers will be officially published in the NeurIPS proceedings .

SUBMISSIONS.  There will be one deadline this year. It is also still possible to submit datasets and benchmarks to the main conference (under the usual review process), but dual submission to both is not allowed (unless you retracted your paper from the main conference). We also cannot transfer papers from the main track to the D&B track. Authors can choose to submit either single-blind or double-blind . If it is possible to properly review the submission double-blind, i.e., reviewers do not need access to non-anonymous repositories to review the work, then authors can also choose to submit the work anonymously. Papers will not be publicly visible during the review process. Only accepted papers will become visible afterward. The reviews themselves are not visible during the review phase but will be published after decisions have been made. The datasets themselves should be accessible to reviewers but can be publicly released at a later date (see below). New authors cannot be added after the abstract deadline and they should have an OpenReview profile by the paper deadline. NeurIPS does not tolerate any collusion whereby authors secretly cooperate with reviewers, ACs or SACs to obtain favourable reviews.

SCOPE.  This track welcomes all work on data-centric machine learning research (DMLR), covering ML datasets and benchmarks as well as algorithms, tools, methods, and analyses for working with ML data. This includes but is not limited to:

  • New datasets, or carefully and thoughtfully designed (collections of) datasets based on previously available data.
  • Data generators and reinforcement learning environments.
  • Data-centric AI methods and tools, e.g. to measure and improve data quality or utility, or studies in data-centric AI that bring important new insight.
  • Advanced practices in data collection and curation that are of general interest even if the data itself cannot be shared.
  • Frameworks for responsible dataset development, audits of existing datasets, identifying significant problems with existing datasets and their use
  • Benchmarks on new or existing datasets, as well as benchmarking tools.
  • In-depth analyses of machine learning challenges and competitions (by organisers and/or participants) that yield important new insight.
  • Systematic analyses of existing systems on novel datasets yielding important new insight.

Read our original blog post for more about why we started this track.

Important dates

  • Abstract submission deadline: May 29, 2024
  • Full paper submission and co-author registration deadline: Jun 5, 2024
  • Supplementary materials submission deadline: Jun 12, 2024
  • Review deadline - Jul 24, 2024
  • Release of reviews and start of Author discussions on OpenReview: Aug 07, 2024
  • End of author/reviewer discussions on OpenReview: Aug 31, 2024
  • Author notification: Sep 26, 2024
  • Camera-ready deadline: Oct 30, 2024 AOE

Note: The site will start accepting submissions on April 1 5 , 2024.

FREQUENTLY ASKED QUESTIONS

Q: My work is in scope for this track but possibly also for the main conference. Where should I submit it?

A: This is ultimately your choice. Consider the main contribution of the submission and how it should be reviewed. If the main contribution is a new dataset, benchmark, or other work that falls into the scope of the track (see above), then it is ideally reviewed accordingly. As discussed in our blog post, the reviewing procedures of the main conference are focused on algorithmic advances, analysis, and applications, while the reviewing in this track is equally stringent but designed to properly assess datasets and benchmarks. Other, more practical considerations are that this track allows single-blind reviewing (since anonymization is often impossible for hosted datasets) and intended audience, i.e., make your work more visible for people looking for datasets and benchmarks.

Q: How will paper accepted to this track be cited?

A: Accepted papers will appear as part of the official NeurIPS proceedings.

Q: Do I need to submit an abstract beforehand?

A: Yes, please check the important dates section for more information.

Q: My dataset requires open credentialized access. Can I submit to this track?

A: This will be possible on the condition that a credentialization is necessary for the public good (e.g. because of ethically sensitive medical data), and that an established credentialization procedure is in place that is 1) open to a large section of the public, 2) provides rapid response and access to the data, and 3) is guaranteed to be maintained for many years. A good example here is PhysioNet Credentialing, where users must first understand how to handle data with human subjects, yet is open to anyone who has learned and agrees with the rules. This should be seen as an exceptional measure, and NOT as a way to limit access to data for other reasons (e.g. to shield data behind a Data Transfer Agreement). Misuse would be grounds for desk rejection. During submission, you can indicate that your dataset involves open credentialized access, in which case the necessity, openness, and efficiency of the credentialization process itself will also be checked.

SUBMISSION INSTRUCTIONS

A submission consists of:

  • Please carefully follow the Latex template for this track when preparing proposals. We follow the NeurIPS format, but with the appropriate headings, and without hiding the names of the authors. Download the template as a bundle here .
  • Papers should be submitted via OpenReview
  • Reviewing is in principle single-blind, hence the paper should not be anonymized. In cases where the work can be reviewed equally well anonymously, anonymous submission is also allowed.
  • During submission, you can add a public link to the dataset or benchmark data. If the dataset can only be released later, you must include instructions for reviewers on how to access the dataset. This can only be done after the first submission by sending an official note to the reviewers in OpenReview. We highly recommend making the dataset publicly available immediately or before the start of the NeurIPS conference. In select cases, requiring solid motivation, the release date can be stretched up to a year after the submission deadline.
  • Dataset documentation and intended uses. Recommended documentation frameworks include datasheets for datasets , dataset nutrition labels , data statements for NLP , data cards , and accountability frameworks .
  • URL to website/platform where the dataset/benchmark can be viewed and downloaded by the reviewers. 
  • URL to Croissant metadata record documenting the dataset/benchmark available for viewing and downloading by the reviewers. You can create your Croissant metadata using e.g. the Python library available here: https://github.com/mlcommons/croissant
  • Author statement that they bear all responsibility in case of violation of rights, etc., and confirmation of the data license.
  • Hosting, licensing, and maintenance plan. The choice of hosting platform is yours, as long as you ensure access to the data (possibly through a curated interface) and will provide the necessary maintenance.
  • Links to access the dataset and its metadata. This can be hidden upon submission if the dataset is not yet publicly available but must be added in the camera-ready version. In select cases, e.g when the data can only be released at a later date, this can be added afterward (up to a year after the submission deadline). Simulation environments should link to open source code repositories
  • The dataset itself should ideally use an open and widely used data format. Provide a detailed explanation on how the dataset can be read. For simulation environments, use existing frameworks or explain how they can be used.
  • Long-term preservation: It must be clear that the dataset will be available for a long time, either by uploading to a data repository or by explaining how the authors themselves will ensure this
  • Explicit license: Authors must choose a license, ideally a CC license for datasets, or an open source license for code (e.g. RL environments). An overview of licenses can be found here: https://paperswithcode.com/datasets/license
  • Add structured metadata to a dataset's meta-data page using Web standards (like schema.org and DCAT ): This allows it to be discovered and organized by anyone. A guide can be found here: https://developers.google.com/search/docs/data-types/dataset . If you use an existing data repository, this is often done automatically.
  • Highly recommended: a persistent dereferenceable identifier (e.g. a DOI  minted by a data repository or a prefix on identifiers.org ) for datasets, or a code repository (e.g. GitHub, GitLab,...) for code. If this is not possible or useful, please explain why.
  • For benchmarks, the supplementary materials must ensure that all results are easily reproducible. Where possible, use a reproducibility framework such as the ML reproducibility checklist , or otherwise guarantee that all results can be easily reproduced, i.e. all necessary datasets, code, and evaluation procedures must be accessible and documented.
  • For papers introducing best practices in creating or curating datasets and benchmarks, the above supplementary materials are not required.
  • For papers resubmitted after being retracted from another venue: a brief discussion on the main concerns raised by previous reviewers and how you addressed them. You do not need to share the original reviews.
  • For the dual submission and archiving, the policy follows the NeurIPS main track paper guideline .

Use of Large Language Models (LLMs): We welcome authors to use any tool that is suitable for preparing high-quality papers and research. However, we ask authors to keep in mind two important criteria. First, we expect papers to fully describe their methodology, and any tool that is important to that methodology, including the use of LLMs, should be described also. For example, authors should mention tools (including LLMs) that were used for data processing or filtering, visualization, facilitating or running experiments, and proving theorems. It may also be advisable to describe the use of LLMs in implementing the method (if this corresponds to an important, original, or non-standard component of the approach). Second, authors are responsible for the entire content of the paper, including all text and figures, so while authors are welcome to use any tool they wish for writing the paper, they must ensure that all text is correct and original.

REVIEWING AND SELECTION PROCESS

Reviewing will be single-blind, although authors can also submit anonymously if the submission allows that. A datasets and benchmarks program committee will be formed, consisting of experts on machine learning, dataset curation, and ethics. We will ensure diversity in the program committee, both in terms of background as well as technical expertise (e.g., data, ML, data ethics, social science expertise). Each paper will be reviewed by the members of the committee. In select cases where ethical concerns are flagged by reviewers, an ethics review may be performed as well.

Papers will not be publicly visible during the review process. Only accepted papers will become visible afterward. The reviews themselves are also not visible during the review phase but will be published after decisions have been made. Authors can choose to keep the datasets themselves hidden until a later release date, as long as reviewers have access.

The factors that will be considered when evaluating papers include:

  • Utility and quality of the submission: Impact, originality, novelty, relevance to the NeurIPS community will all be considered. 
  • Reproducibility: All submissions should be accompanied by sufficient information to reproduce the results described i.e. all necessary datasets, code, and evaluation procedures must be accessible and documented. We encourage the use of a reproducibility framework such as the ML reproducibility checklist to guarantee that all results can be easily reproduced. Benchmark submissions in particular should take care to ensure sufficient details are provided to ensure reproducibility. If submissions include code, please refer to the NeurIPS code submission guidelines .  
  • Was code provided (e.g. in the supplementary material)? If provided, did you look at the code? Did you consider it useful in guiding your review? If not provided, did you wish code had been available?
  • Ethics: Any ethical implications of the work should be addressed. Authors should rely on NeurIPS ethics guidelines as guidance for understanding ethical concerns.  
  • Completeness of the relevant documentation: Per NeurIPS ethics guidelines , datasets must be accompanied by documentation communicating the details of the dataset as part of their submissions via structured templates (e.g. TODO). Sufficient detail must be provided on how the data was collected and organized, what kind of information it contains,  ethically and responsibly, and how it will be made available and maintained. 
  • Licensing and access: Per NeurIPS ethics guidelines , authors should provide licenses for any datasets released. These should consider the intended use and limitations of the dataset, and develop licenses and terms of use to prevent misuse or inappropriate use.  
  • Consent and privacy: Per  NeurIPS ethics guidelines , datasets should minimize the exposure of any personally identifiable information, unless informed consent from those individuals is provided to do so. Any paper that chooses to create a dataset with real data of real people should ask for the explicit consent of participants, or explain why they were unable to do so.
  • Ethics and responsible use: Any ethical implications of new datasets should be addressed and guidelines for responsible use should be provided where appropriate. Note that, if your submission includes publicly available datasets (e.g. as part of a larger benchmark), you should also check these datasets for ethical issues. You remain responsible for the ethical implications of including existing datasets or other data sources in your work.
  • Legal compliance: For datasets, authors should ensure awareness and compliance with regional legal requirements.

ADVISORY COMMITTEE

The following committee will provide advice on the organization of the track over the coming years: Sergio Escalera, Isabelle Guyon, Neil Lawrence, Dina Machuve, Olga Russakovsky, Joaquin Vanschoren, Serena Yeung.

DATASETS AND BENCHMARKS CHAIRS

Lora Aroyo, Google Francesco Locatello, Institute of Science and Technology Austria Lingjuan Lyu, Sony AI

Contact: [email protected]

Database Normalization and Implementation Essay

Tables in the Queensland state soccer carnival database are normalized to the third normal form meaning that they do not exhibit repeating groups, functional dependencies and transitive dependencies. To demonstrate this we consider the tbl_player and tbl_score tables.

The tbl_Player table is created to hold information on the players participating in the Queensland state soccer carnival and it comprises of seven fields, namely, Player Number, Full Name, Address, Contact, Age, Medical Notes and Team Number. The Team Number field is a foreign key linking tbl_Player and tbl_Team.

In tbl_Player, the primary key is Player Number, which is also the only candidate key since Full Name, Address, Contact, Age, Medical Notes are not guaranteed to be unique. The Player Number field holds a unique number that is assigned to each player who is taking part in the tournament.

There will be one player full name, address, contact, age and medical Notes per player number. This means that player number functionally determines every other attribute in tbl_player and therefore, there are no repeating groups and thus the first normal form is met (Mendes, n.d, 1). There is a single simple candidate key (Player Number) and which is also the primary key; therefore, no partial dependencies are possible.

Player Full Name or any other field in tbl_Player apart from Player Number cannot be used to functionally determine any other attribute in the table since players can have the same full names, address, contact, age and medical notes and thus the second normal form is met (Mendes, n.d, 1).

There are no transitive dependencies between tbl_Player’s non-candidate keys (Full Name, Address, Contact, Age, Medical Notes) and thus the third normal form is met (Mendes, n.d, 1).

The tbl_Team table is created to hold information on the teams nominated for the Queensland state soccer carnival and it comprises of two fields, namely, Team Name and Team Colors. The tbl_Team table does not have any foreign key. In tbl_Team, the primary key is Team Name, which is also the only candidate key since Team Colors is not guaranteed to be unique.

The Team Name field holds the unique name of each of the teams participating in the tournament. There will be one team colour per team name. This means that team name functionally determines every other attribute in tbl_Team and therefore, there are no repeating groups and thus the first normal form is met (Mendes, n.d, 1).

There is a single simple candidate key (Team name) and which is also the primary key; therefore, no partial dependencies exist in the table.

Team Colour does not functionally determine any other attribute in the table since there are only two fields in tbl_Team and furthermore, it is not a candidate key field and thus the second normal form is met (Mendes, n.d, 1). No transitive dependencies exist in tbl_Team as there are only two fields in the table and thus the third normal form is met (Mendes, n.d, 1).

Implementation report

The Queensland state soccer carnival database comprises of a total of nine tables. Junction tables have been included in the Queensland state soccer carnival database to break down many-to-many relationships existing between certain relations.

The junction tables are tbl_Match&Team and tbl_Referee&Match that breakdown the many-to-many relationship between tbl_Match and tbl_Team tables and tbl_Referee and tbl_Match tables respectively.

In the implementation of the database in MS Access look up lists have been used to make the process of data entry more easy and accurate. Referential integrity has been enforced so that the integrity of the data held by the database is maintained.

Also as a measure of improving data integrity in the database, integrity constraints have been enforced varyingly in the database to ensure correct and accurate input of data in it.

Queries to the database follow the known structure of SQL queries, that is, starting with the Select clause followed it by the From clause and then by the Where clause then by the Group By clause then by the Having clause and then finally with the Order By clause (Garcia-Molina et al, 2000, 15).

Works Cited

Garcia_Molina, Hector, Ullman, D. Jeffrey, And Widom, Jennifer. Database System Implementation . Singapore: Addison Wesley Longman, 2000. Print.

Mendes Emilia. Normalization Rules: 1NF, 2NF, 3NF. Web. < https://www.auckland.ac.nz/en.html >.

  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2022, April 20). Database Normalization and Implementation. https://ivypanda.com/essays/database-implentation/

"Database Normalization and Implementation." IvyPanda , 20 Apr. 2022, ivypanda.com/essays/database-implentation/.

IvyPanda . (2022) 'Database Normalization and Implementation'. 20 April.

IvyPanda . 2022. "Database Normalization and Implementation." April 20, 2022. https://ivypanda.com/essays/database-implentation/.

1. IvyPanda . "Database Normalization and Implementation." April 20, 2022. https://ivypanda.com/essays/database-implentation/.

Bibliography

IvyPanda . "Database Normalization and Implementation." April 20, 2022. https://ivypanda.com/essays/database-implentation/.

  • An Overview on Application of Data Normalization
  • Syntax-Semantic Roles
  • The Carnival Triumph Cruise Ship's Crisis Management
  • Business Information System Management
  • Car Maintenance Company Relational Database System
  • Information Technology Enterprise Data Quality Management Case
  • "Which Data Warehouse Architecture Is Most Successful?" By Ariyachandra and Watson
  • Data Warehousing as an Information Management Tool

IMAGES

  1. Normalization in DBMS

    essay on database normalization

  2. Database Normalization and Child Essay Example

    essay on database normalization

  3. Database Normalization In DBMS

    essay on database normalization

  4. Aspects of Database Normalization

    essay on database normalization

  5. Aspects of Database Normalization

    essay on database normalization

  6. Mastering Database Normalization: Best Practices and Techniques

    essay on database normalization

VIDEO

  1. The Secondary Essay Database from ProspectiveDoctor is a free site! Link in the comments 🔗 #premed

  2. Database Normalization Demystified

  3. AL ICT Database

  4. Loneliness, Tradition, and Public Spaces

  5. The Secondary Essay Database from ProspectiveDoctor is a FREE webpage (link is in the comments!) 🔗

  6. What's NORMAL? Database normalization and denormalization. #database #dataanalytics #datamodeling

COMMENTS

  1. A Step-By-Step Guide to Normalization in DBMS With Examples

    Normalization in a DBMS is done to achieve these points. Without normalization on a database, the data can be slow, incorrect, and messy. Data Anomalies. Some of these points above relate to "anomalies". An anomaly is where there is an issue in the data that is not meant to be there. This can happen if a database is not normalised.

  2. (PDF) Mastering database normalization: A comprehensive exploration of

    The most commonly used normal forms are: 1. First Normal Form (1NF): • Each table must have a primary key, which uniquely identifies each row. • Each column in the table must contain atomic ...

  3. (PDF) The Database Normalization Theory and the Theory ...

    In this paper, we search a common ground of the database normalization theory and the theory of normalized systems. Different design options to combine and separate concerns

  4. Database normalization

    Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity.It was first proposed by British computer scientist Edgar F. Codd as part of his relational model.. Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure ...

  5. Introduction to Data Normalization: Database Design 101

    Data normalization is a process in which data attributes within a data model are organized to increase the cohesion of entity types. In other words, the goal of data normalization is to reduce and even eliminate data redundancy, an important consideration for application developers because it is incredibly difficult to stores objects in a ...

  6. A Comprehensive Guide to Database Normalization with Examples

    Conclusion. Database normalization is a critical process in database design, aimed at optimizing data storage, improving data integrity, and reducing data anomalies. By organizing data into normalized tables, you can enhance the efficiency and maintainability of your database system. Remember that achieving higher normal forms, such as BCNF and ...

  7. Database Normalization: Simplified

    Database normalization, from 1NF to 6NF, is all about organizing data smartly. It's like keeping your room tidy: everything is easier to find, and there's no unnecessary clutter. By understanding these principles, we can design databases that are efficient, easy to maintain, and reliable. Remember, a well-organized database is like a well ...

  8. Understanding Database Normalization and the Importance of 1NF

    In conclusion. Database normalization is not just a theoretical concept; it's a practical approach to designing robust databases that stand the test of time. By implementing the principles of 1NF, 2NF, and 3NF, we lay the groundwork for databases that are logical, scalable, and efficient. These normalization forms help us avoid data anomalies ...

  9. The Database Normalization Theory and the Theory

    The database normalization theory does not deal with the data redundancy caused by the need to protect data assets by making distinct copies of them (by using replication or by making backups). Similarly, we want to protect source code or documents by making copies of them, thus increasing redundancy and CEs.

  10. Normal Forms and Normalization

    Normalization algorithms try to achieve the goal of information losslessness; if any of them transforms a database schema S into a database schema S′, then S′ should dominate S calculously. The standard normalization algorithm for BCNF uses only the projection operator to transform a schema [ 1 ] and, thus, calculously dominance is defined ...

  11. Normalization in DBMS: 1NF, 2NF, 3NF, and BCNF [Examples]

    To understand (DBMS)normalization with example tables, let's assume that we are storing the details of courses and instructors in a university. Here is what a sample database could look like: Course code. Course venue. Instructor Name. Instructor's phone number. CS101. Lecture Hall 20. Prof. George.

  12. Database Normalization

    A normalized DB schema avoids certain anomalies when inserting, updating, or deleting data and therefore helps to keep the data in the database consistent. However, the absence of anomalies is only the tangible result of a deeper benefit of normalization: correct identification and modeling of entities. The insert, update, and delete anomalies ...

  13. Database Normalization: Minimization of Data Redundancy Essay

    Normalization involves the rearrangement of columns, rows, and tables' relations in a relational database to minimize redundancy of data stored in them. The process of rearrangement entails three consecutive processes, namely, first normal form, second normal form, and third normal form. In describing normalization, this essay will use the ...

  14. Normalization in Relational Databases: First Normal Form (1NF), Second

    Database Normalization: Summary. First, second, and third normal forms are the basic normal forms in database normalization: The first normal form (1NF) states that each attribute in the relation is atomic. The second normal form (2NF) states that non-prime attributes must be functionally dependent on the entire candidate key.

  15. Database Normalization Explained with Real-World Examples

    Normalization involves splitting this data into separate tables (e.g., Employees, Departments, Managers) and linking them through foreign keys. This separation ensures that changes in department ...

  16. Database Normalization Essay

    Database Normalization Essay. 844 Words4 Pages. i. The process of normalization is the process of organizing the columns (attributes) and tables (relations) of a relational database to reduce data redundancy and improve data integrity. It involves arranging attributes in relations based on dependencies between attributes, ensuring that the ...

  17. An Overview on Application of Data Normalization Essay

    Introduction. It needs to be said that data normalization is an interesting topic for discussion. It is a process that is focused on the organization of the contents of a particular database to address any issues with consistency (Fong, 2015).

  18. [PDF] The Database Normalization Theory and the Theory of Normalized

    A search for a common ground of the database normalization theory and the theory of normalized systems is searched and it is thought that the theories are deeply related. Database normalization theory offers formalized guidelines how to reduce data redundancy and thus problems that it causes in databases. More lately, researchers have started to formalize ideas that the problems caused by ...

  19. Database normalization

    Normalization can be used as a methodology for achieving an organized and logically designed database free of overused excessive data. The normalization process includes three integral fundamental steps which are the first normal form, second normal form, and third normal form. Logically organizing data in a manner that reduces or eradicates ...

  20. Data Normalization and Its Main Advantages Essay

    We will write a custom essay on your topic a custom Essay on Data Normalization and Its Main Advantages. 808 writers online . Learn More . Relationships should first be in the 1NF to qualify to be in 2NF. Suppose there is no valid subgroup of any primary keys of the table that relies on any proper subgroup of any non-prime character (attributes ...

  21. Aspects of Database Normalization

    Aspects of Database Normalization. Words: 571 Pages: 2. Database normalization is a database design technique that eliminates undesirable characteristics such as Insertion, Update, and Deletion Anomalies and reduces data redundancy. Normalization rules divide larger tables into smaller tables and use relationships to connect them.

  22. Database Normalization Essays

    Published: 03/02/2020. Normalization is an approach of decomposing tables with an aim of eliminating redundant data and ensuring that data dependencies make sense through logical arrangement and storage. Normalization of a database allows handling and update of the database. Data storage should happen once to avoid storage of data that can ...

  23. Data normalization for addressing the challenges in the analysis of

    Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about ...

  24. Immigraft by Jayesh Rathod, Anne Schaufele :: SSRN

    This Essay introduces the term immigraft to capture this phenomenon, defined as the unjust transfer of funds from individuals to the state in the context of efforts to obtain immigration benefits or relief from the state. ... including the normalization of extraction of value from noncitizens and its corrosive effect on the relationship among ...

  25. ALGORITHMIC WARFARE: Data Normalization, Distribution Key to CJADC2

    ALGORITHMIC WARFARE: Data Normalization, Distribution Key to CJADC2. The Defense Department's Combined Joint All-Domain Command and Control concept envisions a future where data is pulled from a multitude of sensors operated by different services and partners across all domains and then passed along to the correct "shooter" to execute a ...

  26. [2405.04517] xLSTM: Extended Long Short-Term Memory

    Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2405.04517 [cs.LG]

  27. Photo Essay: My Spring 2024 Semester at CDS

    Computational Humanities, Arts & Social Sciences ( CHASS) hosted a variety of tutorials ranging from "An Analysis on Emerson's Work" to large language model discussions throughout the Spring 2024 semester. These sessions are a great way to learn about the data science industry and how your skills will be used in the real world.

  28. Call For Datasets & Benchmarks 2024

    The previous editions of the Datasets and Benchmarks track were highly successful; you can view the accepted papers from 2021, 2002, and 2023, and the winners of the best paper awards 2021, ... Data-centric AI methods and tools, e.g. to measure and improve data quality or utility, or studies in data-centric AI that bring important new insight. ...

  29. Database Normalization and Implementation Essay

    Database Normalization and Implementation Essay Exclusively available on IvyPanda Available only on IvyPanda Tables in the Queensland state soccer carnival database are normalized to the third normal form meaning that they do not exhibit repeating groups, functional dependencies and transitive dependencies.

  30. The Fed

    The Federal Reserve conducted a pilot climate scenario analysis (CSA) exercise in 2023 to learn about large banking organizations' climate risk-management practices and challenges and to enhance the ability of large banking organizations and supervisors to identify, estimate, monitor, and manage climate-related financial risks.