How to Remove Duplicates Based on ID in PostgreSQL

Programming & Development / April 18, 2025

How to Remove Duplicates Based on ID in PostgreSQL

remove duplicates PostgreSQL ROW_NUMBER() delete duplicates SQL query duplicate rows database cleanup

When working with databases, you might come across situations where your table contains duplicate rows. In PostgreSQL, you can efficiently remove duplicate rows while retaining one unique entry based on a specific criterion (e.g., the first occurrence or the one with the most recent timestamp). This is especially useful when you have tables with duplicate IDs and want to keep only one of each.

In this article, we'll guide you through the process of removing duplicate rows based on the id column, using PostgreSQL's ROW_NUMBER() window function.

🧱 Solution Overview

The key steps in removing duplicate rows are:

Identifying duplicates: Use the ROW_NUMBER() function to assign a sequential number to rows with the same ID.
Deleting duplicates: Once duplicates are identified, delete rows where the ROW_NUMBER() is greater than 1, keeping only the first occurrence of each ID.

🔨 SQL Query to Remove Duplicates

Let's assume you have a table called my_table with an id column, and you want to remove duplicate rows while keeping the one with the earliest created_at timestamp.

Step 1: Identify Duplicates

First, identify the duplicates using the ROW_NUMBER() window function:

sql

SELECT
    id,
    created_at,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at) AS row_num
FROM
    my_table;

Explanation:

ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at) assigns a unique number (row_num) to each row within the same id group, ordering by created_at.
The row with the row_num of 1 is kept, and the rest are considered duplicates.

Step 2: Delete Duplicates

Next, delete all rows where row_num is greater than 1, effectively keeping only the first occurrence of each ID:

sql

DELETE FROM my_table
WHERE id IN (
    SELECT id
    FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at) AS row_num
        FROM my_table
    ) subquery
    WHERE row_num > 1
);

Explanation:

The subquery assigns row_num to each row and selects those where row_num is greater than 1.
The outer query deletes these duplicate rows from my_table.

🧑‍💻 Practical Example

Let's use a concrete example with a table named employees. Suppose the employees table has the following columns: employee_id, name, and created_at. You want to remove duplicate employee_id rows, keeping the earliest record based on the created_at timestamp.

Step 1: Identify Duplicates

sql

SELECT
    employee_id,
    name,
    created_at,
    ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY created_at) AS row_num
FROM
    employees;

This query shows which rows are duplicates by assigning a row number to each employee_id group.

Step 2: Delete Duplicates

sql

DELETE FROM employees
WHERE employee_id IN (
    SELECT employee_id
    FROM (
        SELECT employee_id, ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY created_at) AS row_num
        FROM employees
    ) subquery
    WHERE row_num > 1
);

This query deletes all duplicates, keeping the first occurrence of each employee_id based on the earliest created_at timestamp.

🔧 Customizing the Approach

Different ordering criteria: If you want to keep the most recent record instead of the earliest, simply change the ORDER BY clause to ORDER BY created_at DESC.
Multiple columns for duplication check: If your definition of "duplicate" includes multiple columns, you can modify the PARTITION BY clause to include those columns.

🧠 Conclusion

Using the ROW_NUMBER() window function, PostgreSQL allows you to identify and remove duplicate rows in a clean, efficient manner. By following the steps outlined in this article, you can easily manage and clean up your database, ensuring that only unique entries remain based on your desired criteria.

Share this article:

Comments

No comments yet

NUHMAN.com