When working with databases, you might come across situations where your table contains duplicate rows. In PostgreSQL, you can efficiently remove duplicate rows while retaining one unique entry based on a specific criterion (e.g., the first occurrence or the one with the most recent timestamp). This is especially useful when you have tables with duplicate IDs and want to keep only one of each.
In this article, we'll guide you through the process of removing duplicate rows based on the id
column, using PostgreSQL's ROW_NUMBER()
window function.
🧱 Solution Overview
The key steps in removing duplicate rows are:
- Identifying duplicates: Use the
ROW_NUMBER()
function to assign a sequential number to rows with the same ID. - Deleting duplicates: Once duplicates are identified, delete rows where the
ROW_NUMBER()
is greater than 1, keeping only the first occurrence of each ID.
🔨 SQL Query to Remove Duplicates
Let's assume you have a table called my_table
with an id
column, and you want to remove duplicate rows while keeping the one with the earliest created_at
timestamp.
Step 1: Identify Duplicates
First, identify the duplicates using the ROW_NUMBER()
window function:
sql
SELECT
id,
created_at,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at) AS row_num
FROM
my_table;
Explanation:
ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at)
assigns a unique number (row_num
) to each row within the same id
group, ordering by created_at
.- The row with the
row_num
of 1 is kept, and the rest are considered duplicates.
Step 2: Delete Duplicates
Next, delete all rows where row_num
is greater than 1, effectively keeping only the first occurrence of each ID:
sql
DELETE FROM my_table
WHERE id IN (
SELECT id
FROM (
SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at) AS row_num
FROM my_table
) subquery
WHERE row_num > 1
);
Explanation:
- The subquery assigns
row_num
to each row and selects those where row_num
is greater than 1. - The outer query deletes these duplicate rows from
my_table
.
🧑💻 Practical Example
Let's use a concrete example with a table named employees
. Suppose the employees
table has the following columns: employee_id
, name
, and created_at
. You want to remove duplicate employee_id
rows, keeping the earliest record based on the created_at
timestamp.
Step 1: Identify Duplicates
sql
SELECT
employee_id,
name,
created_at,
ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY created_at) AS row_num
FROM
employees;
This query shows which rows are duplicates by assigning a row number to each employee_id
group.
Step 2: Delete Duplicates
sql
DELETE FROM employees
WHERE employee_id IN (
SELECT employee_id
FROM (
SELECT employee_id, ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY created_at) AS row_num
FROM employees
) subquery
WHERE row_num > 1
);
This query deletes all duplicates, keeping the first occurrence of each employee_id
based on the earliest created_at
timestamp.
🔧 Customizing the Approach
- Different ordering criteria: If you want to keep the most recent record instead of the earliest, simply change the
ORDER BY
clause to ORDER BY created_at DESC
. - Multiple columns for duplication check: If your definition of "duplicate" includes multiple columns, you can modify the
PARTITION BY
clause to include those columns.
🧠 Conclusion
Using the ROW_NUMBER()
window function, PostgreSQL allows you to identify and remove duplicate rows in a clean, efficient manner. By following the steps outlined in this article, you can easily manage and clean up your database, ensuring that only unique entries remain based on your desired criteria.