How to delete duplicates from a table

To delete duplicates from a table, you’ll need to identify the duplicate rows first and then delete them using SQL. The approach varies slightly depending on whether you’re working with SQL, a specific database management system (e.g., MySQL, PostgreSQL, SQL Server), or a tool like Excel or Python.

Table of Contents

1. SQL – Deleting Duplicates

The general approach to deleting duplicates in a table involves:

Identifying the duplicates.
Keeping one row (usually the first or last occurrence).
Deleting the extra duplicate rows.

Method 1: Using `ROW_NUMBER()` to Delete Duplicates (Recommended Method)

Consider the following table, employees, which contains some duplicate rows based on the combination of first_name, last_name, and department:

id	first_name	last_name	department
1	John	Doe	HR
2	Jane	Smith	IT
3	John	Doe	HR
4	Mark	Lee	Marketing
5	Jane	Smith	IT
6	John	Doe	HR

The goal is to delete the duplicate rows based on first_name, last_name, and department, while keeping the first occurrence of each combination.

WITH CTE AS (
    SELECT 
        id,
        first_name,
        last_name,
        department,
        ROW_NUMBER() OVER (PARTITION BY first_name, last_name, department 
    ORDER BY id) AS rn
    FROM employees
)
DELETE FROM employees
WHERE id IN (
    SELECT id
    FROM CTE
    WHERE rn > 1
);

Explanation:

ROW_NUMBER():
- We use ROW_NUMBER() to assign a unique sequential number (rn) to each row in a group of rows with the same values in the columns first_name, last_name, and department.
- The PARTITION BY clause groups rows by these columns, and the ORDER BY id ensures that the row with the smallest id (the first occurrence) gets assigned rn = 1.
CTE (Common Table Expression):
- The CTE (WITH clause) creates a temporary result set that contains the id, first_name, last_name, department, and the row number (rn) for each row.
Delete Duplicates:
- The DELETE statement removes rows from the employees table where the row number (rn) is greater than 1, meaning these are duplicates.
- The IN subquery selects the id values where rn > 1, which identifies the duplicates that need to be deleted.

Result after deletion:

id	first_name	last_name	department
1	John	Doe	HR
2	Jane	Smith	IT
4	Mark	Lee	Marketing

Method 2: Using `JOIN` to Delete Duplicates

This method involves using a self-join to match rows that have the same first_name, last_name, and department, and then delete the duplicates by comparing their id values.

DELETE t1
FROM employees t1
JOIN employees t2
ON t1.first_name = t2.first_name
   AND t1.last_name = t2.last_name
   AND t1.department = t2.department
   AND t1.id > t2.id;

Explanation:

Self-Join:
- We join the employees table (t1) with itself (t2) on the columns first_name, last_name, and department.
- This identifies rows in t1 that match rows in t2 (i.e., duplicates).
Condition to Keep One Row:
- The condition t1.id > t2.id ensures that we keep the row with the smallest id (i.e., the first occurrence), and delete all other rows that have the same first_name, last_name, and department.
Delete:
- The DELETE statement removes rows from t1 (the first table in the join) where t1.id > t2.id, meaning the rows that are duplicates.

Method 3: Using Temporary Table

-- Step 1: Create a temporary table with distinct rows
CREATE TEMPORARY TABLE temp_table AS
SELECT DISTINCT first_name, last_name, department
FROM employees;

-- Step 2: Delete all rows from the original table
DELETE FROM employees;

-- Step 3: Insert unique rows back from the temporary table
INSERT INTO employees (first_name, last_name, department)
SELECT first_name, last_name, department
FROM temp_table;

-- Step 4: Drop the temporary table
DROP TABLE temp_table;

Explanation:

Temporary Table:
- The CREATE TEMPORARY TABLE statement creates a new table (temp_table) that contains only distinct rows from the employees table. This step removes duplicates based on the combination of first_name, last_name, and department.
Delete Original Rows:
- The DELETE FROM employees statement deletes all rows from the original employees table.
Insert Cleaned Data:
- The INSERT INTO employees statement copies the unique rows from the temporary table back into the original employees table.
Drop Temporary Table:
- Finally, the DROP TABLE statement removes the temporary table after the data has been transferred.

How to delete duplicates from a table

1. SQL – Deleting Duplicates

Method 1: Using `ROW_NUMBER()` to Delete Duplicates (Recommended Method)

Explanation:

Result after deletion:

Method 2: Using `JOIN` to Delete Duplicates

Explanation:

Method 3: Using Temporary Table

Explanation:

By Vishal Jadhav