To delete duplicates from a table, you’ll need to identify the duplicate rows first and then delete them using SQL. The approach varies slightly depending on whether you’re working with SQL, a specific database management system (e.g., MySQL, PostgreSQL, SQL Server), or a tool like Excel or Python.

1. SQL – Deleting Duplicates

The general approach to deleting duplicates in a table involves:

  1. Identifying the duplicates.
  2. Keeping one row (usually the first or last occurrence).
  3. Deleting the extra duplicate rows.

Method 1: Using ROW_NUMBER() to Delete Duplicates (Recommended Method)

Consider the following table, employees, which contains some duplicate rows based on the combination of first_name, last_name, and department:

idfirst_namelast_namedepartment
1JohnDoeHR
2JaneSmithIT
3JohnDoeHR
4MarkLeeMarketing
5JaneSmithIT
6JohnDoeHR

The goal is to delete the duplicate rows based on first_name, last_name, and department, while keeping the first occurrence of each combination.

WITH CTE AS (
    SELECT 
        id,
        first_name,
        last_name,
        department,
        ROW_NUMBER() OVER (PARTITION BY first_name, last_name, department 
    ORDER BY id) AS rn
    FROM employees
)
DELETE FROM employees
WHERE id IN (
    SELECT id
    FROM CTE
    WHERE rn > 1
);

Explanation:

  1. ROW_NUMBER():
    • We use ROW_NUMBER() to assign a unique sequential number (rn) to each row in a group of rows with the same values in the columns first_name, last_name, and department.
    • The PARTITION BY clause groups rows by these columns, and the ORDER BY id ensures that the row with the smallest id (the first occurrence) gets assigned rn = 1.
  2. CTE (Common Table Expression):
    • The CTE (WITH clause) creates a temporary result set that contains the id, first_name, last_name, department, and the row number (rn) for each row.
  3. Delete Duplicates:
    • The DELETE statement removes rows from the employees table where the row number (rn) is greater than 1, meaning these are duplicates.
    • The IN subquery selects the id values where rn > 1, which identifies the duplicates that need to be deleted.

Result after deletion:

idfirst_namelast_namedepartment
1JohnDoeHR
2JaneSmithIT
4MarkLeeMarketing

Method 2: Using JOIN to Delete Duplicates

This method involves using a self-join to match rows that have the same first_name, last_name, and department, and then delete the duplicates by comparing their id values.

DELETE t1
FROM employees t1
JOIN employees t2
ON t1.first_name = t2.first_name
   AND t1.last_name = t2.last_name
   AND t1.department = t2.department
   AND t1.id > t2.id;

Explanation:

  1. Self-Join:
    • We join the employees table (t1) with itself (t2) on the columns first_name, last_name, and department.
    • This identifies rows in t1 that match rows in t2 (i.e., duplicates).
  2. Condition to Keep One Row:
    • The condition t1.id > t2.id ensures that we keep the row with the smallest id (i.e., the first occurrence), and delete all other rows that have the same first_name, last_name, and department.
  3. Delete:
    • The DELETE statement removes rows from t1 (the first table in the join) where t1.id > t2.id, meaning the rows that are duplicates.

Method 3: Using Temporary Table

-- Step 1: Create a temporary table with distinct rows
CREATE TEMPORARY TABLE temp_table AS
SELECT DISTINCT first_name, last_name, department
FROM employees;

-- Step 2: Delete all rows from the original table
DELETE FROM employees;

-- Step 3: Insert unique rows back from the temporary table
INSERT INTO employees (first_name, last_name, department)
SELECT first_name, last_name, department
FROM temp_table;

-- Step 4: Drop the temporary table
DROP TABLE temp_table;

Explanation:

  1. Temporary Table:
    • The CREATE TEMPORARY TABLE statement creates a new table (temp_table) that contains only distinct rows from the employees table. This step removes duplicates based on the combination of first_name, last_name, and department.
  2. Delete Original Rows:
    • The DELETE FROM employees statement deletes all rows from the original employees table.
  3. Insert Cleaned Data:
    • The INSERT INTO employees statement copies the unique rows from the temporary table back into the original employees table.
  4. Drop Temporary Table:
    • Finally, the DROP TABLE statement removes the temporary table after the data has been transferred.

Leave a Reply

Your email address will not be published. Required fields are marked *