To delete duplicates from a table, you’ll need to identify the duplicate rows first and then delete them using SQL. The approach varies slightly depending on whether you’re working with SQL, a specific database management system (e.g., MySQL, PostgreSQL, SQL Server), or a tool like Excel or Python.
1. SQL – Deleting Duplicates
The general approach to deleting duplicates in a table involves:
- Identifying the duplicates.
- Keeping one row (usually the first or last occurrence).
- Deleting the extra duplicate rows.
Method 1: Using ROW_NUMBER()
to Delete Duplicates (Recommended Method)
Consider the following table, employees
, which contains some duplicate rows based on the combination of first_name
, last_name
, and department
:
id | first_name | last_name | department |
---|---|---|---|
1 | John | Doe | HR |
2 | Jane | Smith | IT |
3 | John | Doe | HR |
4 | Mark | Lee | Marketing |
5 | Jane | Smith | IT |
6 | John | Doe | HR |
The goal is to delete the duplicate rows based on first_name
, last_name
, and department
, while keeping the first occurrence of each combination.
WITH CTE AS (
SELECT
id,
first_name,
last_name,
department,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name, department
ORDER BY id) AS rn
FROM employees
)
DELETE FROM employees
WHERE id IN (
SELECT id
FROM CTE
WHERE rn > 1
);
Explanation:
ROW_NUMBER()
:- We use
ROW_NUMBER()
to assign a unique sequential number (rn
) to each row in a group of rows with the same values in the columnsfirst_name
,last_name
, anddepartment
. - The
PARTITION BY
clause groups rows by these columns, and theORDER BY id
ensures that the row with the smallestid
(the first occurrence) gets assignedrn = 1
.
- We use
- CTE (Common Table Expression):
- The CTE (
WITH
clause) creates a temporary result set that contains theid
,first_name
,last_name
,department
, and the row number (rn
) for each row.
- The CTE (
- Delete Duplicates:
- The
DELETE
statement removes rows from theemployees
table where the row number (rn
) is greater than 1, meaning these are duplicates. - The
IN
subquery selects theid
values wherern > 1
, which identifies the duplicates that need to be deleted.
- The
Result after deletion:
id | first_name | last_name | department |
---|---|---|---|
1 | John | Doe | HR |
2 | Jane | Smith | IT |
4 | Mark | Lee | Marketing |
Method 2: Using JOIN
to Delete Duplicates
This method involves using a self-join to match rows that have the same first_name
, last_name
, and department
, and then delete the duplicates by comparing their id
values.
DELETE t1
FROM employees t1
JOIN employees t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.department = t2.department
AND t1.id > t2.id;
Explanation:
- Self-Join:
- We join the
employees
table (t1
) with itself (t2
) on the columnsfirst_name
,last_name
, anddepartment
. - This identifies rows in
t1
that match rows int2
(i.e., duplicates).
- We join the
- Condition to Keep One Row:
- The condition
t1.id > t2.id
ensures that we keep the row with the smallestid
(i.e., the first occurrence), and delete all other rows that have the samefirst_name
,last_name
, anddepartment
.
- The condition
- Delete:
- The
DELETE
statement removes rows fromt1
(the first table in the join) wheret1.id > t2.id
, meaning the rows that are duplicates.
- The
Method 3: Using Temporary Table
-- Step 1: Create a temporary table with distinct rows
CREATE TEMPORARY TABLE temp_table AS
SELECT DISTINCT first_name, last_name, department
FROM employees;
-- Step 2: Delete all rows from the original table
DELETE FROM employees;
-- Step 3: Insert unique rows back from the temporary table
INSERT INTO employees (first_name, last_name, department)
SELECT first_name, last_name, department
FROM temp_table;
-- Step 4: Drop the temporary table
DROP TABLE temp_table;
Explanation:
- Temporary Table:
- The
CREATE TEMPORARY TABLE
statement creates a new table (temp_table
) that contains only distinct rows from theemployees
table. This step removes duplicates based on the combination offirst_name
,last_name
, anddepartment
.
- The
- Delete Original Rows:
- The
DELETE FROM employees
statement deletes all rows from the originalemployees
table.
- The
- Insert Cleaned Data:
- The
INSERT INTO employees
statement copies the unique rows from the temporary table back into the originalemployees
table.
- The
- Drop Temporary Table:
- Finally, the
DROP TABLE
statement removes the temporary table after the data has been transferred.
- Finally, the