When working with databases, it’s common to encounter tables that contain duplicate rows based on a certain column, such as the id
. You might want to keep only the first occurrence of each unique ID and remove the rest to maintain data integrity.
In PostgreSQL, this can be efficiently achieved using the ROW_NUMBER()
window function along with a Common Table Expression (CTE). In this article, we'll walk through how to remove duplicate rows based on a specific column (like id
) while retaining only the first occurrence.
🔨 Solution Overview
The approach to remove duplicate rows based on the id
column is:
- Identify duplicates using
ROW_NUMBER()
: This function assigns a unique sequential number to rows within a partition of the id
column. - Delete duplicate rows: By keeping only the first occurrence (where
ROW_NUMBER()
is 1) and removing the rest.
Let’s dive into the detailed SQL queries for this task.
🧑💻 SQL Query to Remove Duplicate IDs
Suppose you have a table called my_table
with an id
column, and you want to remove duplicate rows while keeping the first occurrence based on the id
.
Step 1: Identify Duplicates Using ROW_NUMBER()
We will first create a Common Table Expression (CTE) to assign a row number to each row, partitioned by id
and ordered by id
(or any other column if needed):
sql
WITH duplicates AS (
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS row_num
FROM
my_table
)
- The
ROW_NUMBER()
function generates a sequential number starting from 1 for each partition of id
. - The
PARTITION BY id
ensures that row numbering resets for each unique id
.
Step 2: Delete Duplicate Rows
Next, we’ll use the CTE to delete rows where the row_num
is greater than 1, keeping only the first occurrence of each id
:
sql
DELETE FROM my_table
WHERE id IN (
SELECT id
FROM duplicates
WHERE row_num > 1
);
This query will:
- Identify rows where the
row_num
is greater than 1 (which are duplicates). - Delete those rows from
my_table
.
🧑💻 Practical Example
Let’s consider a table called employees
with the following columns: employee_id
, name
, and created_at
. You want to remove duplicates based on employee_id
, keeping only the first occurrence of each.
Step 1: Identify Duplicates
sql
WITH duplicates AS (
SELECT
employee_id,
ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY created_at) AS row_num
FROM
employees
)
- This query assigns a row number to each
employee_id
group based on the created_at
timestamp, keeping the earliest entry as row_num = 1
.
Step 2: Delete Duplicates
sql
DELETE FROM employees
WHERE employee_id IN (
SELECT employee_id
FROM duplicates
WHERE row_num > 1
);
- This query deletes the duplicate
employee_id
rows, leaving only the first occurrence based on the created_at
timestamp.
🔧 Customizing the Approach
- Different ordering criteria: You can adjust the
ORDER BY
clause to prioritize other columns, such as created_at DESC
to keep the most recent entry. - Multiple columns for duplication check: If you define duplicates by multiple columns, you can modify the
PARTITION BY
clause to include those columns.
For example, if you consider a row duplicate when both employee_id
and name
are the same, you can update the query as follows:
sql
WITH duplicates AS (
SELECT
employee_id,
name,
ROW_NUMBER() OVER (PARTITION BY employee_id, name ORDER BY created_at) AS row_num
FROM
employees
)
🧠 Conclusion
Removing duplicate rows in PostgreSQL is straightforward when using the ROW_NUMBER()
window function along with a Common Table Expression (CTE). By partitioning the data based on a column (such as id
) and ordering the rows, you can easily identify duplicates and delete them while keeping the first occurrence.
With this approach, you can maintain clean and consistent data in your PostgreSQL database, avoiding issues related to duplicate entries.