To delete duplicates from a table, you’ll need to identify the duplicate rows first and then delete them using SQL. The approach varies slightly depending on whether you’re working with SQL, a specific database management system (e.g., MySQL, PostgreSQL, SQL Server), or a tool like Excel or Python.
1. SQL – Deleting Duplicates
The general approach to deleting duplicates in a table involves:
- Identifying the duplicates.
- Keeping one row (usually the first or last occurrence).
- Deleting the extra duplicate rows.
Method 1: Using ROW_NUMBER() to Delete Duplicates (Recommended Method)
Consider the following table, employees, which contains some duplicate rows based on the combination of first_name, last_name, and department:
| id | first_name | last_name | department |
|---|---|---|---|
| 1 | John | Doe | HR |
| 2 | Jane | Smith | IT |
| 3 | John | Doe | HR |
| 4 | Mark | Lee | Marketing |
| 5 | Jane | Smith | IT |
| 6 | John | Doe | HR |
The goal is to delete the duplicate rows based on first_name, last_name, and department, while keeping the first occurrence of each combination.
WITH CTE AS (
SELECT
id,
first_name,
last_name,
department,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name, department
ORDER BY id) AS rn
FROM employees
)
DELETE FROM employees
WHERE id IN (
SELECT id
FROM CTE
WHERE rn > 1
);Explanation:
ROW_NUMBER():- We use
ROW_NUMBER()to assign a unique sequential number (rn) to each row in a group of rows with the same values in the columnsfirst_name,last_name, anddepartment. - The
PARTITION BYclause groups rows by these columns, and theORDER BY idensures that the row with the smallestid(the first occurrence) gets assignedrn = 1.
- We use
- CTE (Common Table Expression):
- The CTE (
WITHclause) creates a temporary result set that contains theid,first_name,last_name,department, and the row number (rn) for each row.
- The CTE (
- Delete Duplicates:
- The
DELETEstatement removes rows from theemployeestable where the row number (rn) is greater than 1, meaning these are duplicates. - The
INsubquery selects theidvalues wherern > 1, which identifies the duplicates that need to be deleted.
- The
Result after deletion:
| id | first_name | last_name | department |
|---|---|---|---|
| 1 | John | Doe | HR |
| 2 | Jane | Smith | IT |
| 4 | Mark | Lee | Marketing |
Method 2: Using JOIN to Delete Duplicates
This method involves using a self-join to match rows that have the same first_name, last_name, and department, and then delete the duplicates by comparing their id values.
DELETE t1
FROM employees t1
JOIN employees t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.department = t2.department
AND t1.id > t2.id;
Explanation:
- Self-Join:
- We join the
employeestable (t1) with itself (t2) on the columnsfirst_name,last_name, anddepartment. - This identifies rows in
t1that match rows int2(i.e., duplicates).
- We join the
- Condition to Keep One Row:
- The condition
t1.id > t2.idensures that we keep the row with the smallestid(i.e., the first occurrence), and delete all other rows that have the samefirst_name,last_name, anddepartment.
- The condition
- Delete:
- The
DELETEstatement removes rows fromt1(the first table in the join) wheret1.id > t2.id, meaning the rows that are duplicates.
- The
Method 3: Using Temporary Table
-- Step 1: Create a temporary table with distinct rows
CREATE TEMPORARY TABLE temp_table AS
SELECT DISTINCT first_name, last_name, department
FROM employees;
-- Step 2: Delete all rows from the original table
DELETE FROM employees;
-- Step 3: Insert unique rows back from the temporary table
INSERT INTO employees (first_name, last_name, department)
SELECT first_name, last_name, department
FROM temp_table;
-- Step 4: Drop the temporary table
DROP TABLE temp_table;
Explanation:
- Temporary Table:
- The
CREATE TEMPORARY TABLEstatement creates a new table (temp_table) that contains only distinct rows from theemployeestable. This step removes duplicates based on the combination offirst_name,last_name, anddepartment.
- The
- Delete Original Rows:
- The
DELETE FROM employeesstatement deletes all rows from the originalemployeestable.
- The
- Insert Cleaned Data:
- The
INSERT INTO employeesstatement copies the unique rows from the temporary table back into the originalemployeestable.
- The
- Drop Temporary Table:
- Finally, the
DROP TABLEstatement removes the temporary table after the data has been transferred.
- Finally, the