How to Remove Duplicate IDs in PostgreSQL

Programming & Development / April 18, 2025

How to Remove Duplicate IDs in PostgreSQL

remove duplicates PostgreSQL duplicate IDs SQL query delete duplicates common table expression data cleanup

When working with databases, it’s common to encounter tables that contain duplicate rows based on a certain column, such as the id. You might want to keep only the first occurrence of each unique ID and remove the rest to maintain data integrity.

In PostgreSQL, this can be efficiently achieved using the ROW_NUMBER() window function along with a Common Table Expression (CTE). In this article, we'll walk through how to remove duplicate rows based on a specific column (like id) while retaining only the first occurrence.

🔨 Solution Overview

The approach to remove duplicate rows based on the id column is:

Identify duplicates using ROW_NUMBER(): This function assigns a unique sequential number to rows within a partition of the id column.
Delete duplicate rows: By keeping only the first occurrence (where ROW_NUMBER() is 1) and removing the rest.

Let’s dive into the detailed SQL queries for this task.

🧑‍💻 SQL Query to Remove Duplicate IDs

Suppose you have a table called my_table with an id column, and you want to remove duplicate rows while keeping the first occurrence based on the id.

Step 1: Identify Duplicates Using `ROW_NUMBER()`

We will first create a Common Table Expression (CTE) to assign a row number to each row, partitioned by id and ordered by id (or any other column if needed):

sql

WITH duplicates AS (
    SELECT 
        id, 
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS row_num
    FROM 
        my_table
)

The ROW_NUMBER() function generates a sequential number starting from 1 for each partition of id.
The PARTITION BY id ensures that row numbering resets for each unique id.

Step 2: Delete Duplicate Rows

Next, we’ll use the CTE to delete rows where the row_num is greater than 1, keeping only the first occurrence of each id:

sql

DELETE FROM my_table
WHERE id IN (
    SELECT id
    FROM duplicates
    WHERE row_num > 1
);

This query will:

Identify rows where the row_num is greater than 1 (which are duplicates).
Delete those rows from my_table.

🧑‍💻 Practical Example

Let’s consider a table called employees with the following columns: employee_id, name, and created_at. You want to remove duplicates based on employee_id, keeping only the first occurrence of each.

Step 1: Identify Duplicates

sql

WITH duplicates AS (
    SELECT 
        employee_id,
        ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY created_at) AS row_num
    FROM 
        employees
)

This query assigns a row number to each employee_id group based on the created_at timestamp, keeping the earliest entry as row_num = 1.

Step 2: Delete Duplicates

sql

DELETE FROM employees
WHERE employee_id IN (
    SELECT employee_id
    FROM duplicates
    WHERE row_num > 1
);

This query deletes the duplicate employee_id rows, leaving only the first occurrence based on the created_at timestamp.

🔧 Customizing the Approach

Different ordering criteria: You can adjust the ORDER BY clause to prioritize other columns, such as created_at DESC to keep the most recent entry.
Multiple columns for duplication check: If you define duplicates by multiple columns, you can modify the PARTITION BY clause to include those columns.

For example, if you consider a row duplicate when both employee_id and name are the same, you can update the query as follows:

sql

WITH duplicates AS (
    SELECT 
        employee_id,
        name,
        ROW_NUMBER() OVER (PARTITION BY employee_id, name ORDER BY created_at) AS row_num
    FROM 
        employees
)

🧠 Conclusion

Removing duplicate rows in PostgreSQL is straightforward when using the ROW_NUMBER() window function along with a Common Table Expression (CTE). By partitioning the data based on a column (such as id) and ordering the rows, you can easily identify duplicates and delete them while keeping the first occurrence.

With this approach, you can maintain clean and consistent data in your PostgreSQL database, avoiding issues related to duplicate entries.

Share this article:

Comments

No comments yet

NUHMAN.com