SQL tips: de-duping a table
Duplicate rows are one of those headaches every data engineer runs into. They creep in through messy source files, flawed joins, or upstream jobs that didn’t quite behave. Once inside, they can throw off reports, inflate metrics, and generally waste compute.
The tricky part isn’t just removing duplicates, but making sure you’re keeping the right record.
We’ll walk through a clean, repeatable way to remove duplicates using SQL window functions.
Setting up the example
The idea is simple. We'll start by identifying the duplicates with ROW_NUMBER(). Then, we keep the first record per group based on our business key, and safely delete the rest, leaving a tidy dataset.
Let's create a sample table with some duplicates to play with:
Here's the data setup (duplicates made bold):
EntityId | EntitySite | EntityType | ObjectId |
|---|---|---|---|
10021 | Investec | JOB | 696969 |
10411 | Investec | JOB | 254815 |
1085 | Living Room | PROPERTY | 221915 |
10026 | Lovells | PROPERTY | 676869 |
10215605 | MawsonCollins | PROPERTY | 1269 |
10215605 | MawsonCollins | PROPERTY | 102461 |
10-Labourers | RavenscroftConstruction | JOB | 253328 |
10539 | Source | JOB | 6969 |
10539 | Source | JOB | 6987 |
10539 | Source | JOB | 253044 |
The approach: tag, then delete
We use the ROW_NUMBER() window function to assign a sequential number to each row within the same business key (in this case, EntitySite and EntityId). Duplicates get a number greater than 1.
Sample output:
EntitySite | EntityId | EntityType | ObjectId | row_num |
|---|---|---|---|---|
Investec | 10021 | JOB | 696969 | 1 |
Investec | 10411 | JOB | 254815 | 1 |
Living Room | 1085 | PROPERTY | 221915 | 1 |
Lovells | 10026 | PROPERTY | 676869 | 1 |
MawsonCollins | 10215605 | PROPERTY | 1269 | 1 |
MawsonCollins | 10215605 | PROPERTY | 102461 | 2 |
RavenscroftConstruction | 10-Labourers | JOB | 253328 | 1 |
Source | 10539 | JOB | 6969 | 1 |
Source | 10539 | JOB | 6987 | 2 |
Source | 10539 | JOB | 253044 | 3 |
Deleting the duplicates
Using a common table expression (CTE), we delete all rows with row_num > 1, keeping only the first occurrence of each duplicate set, prioritising by ObjectId descending (keeping the latest):
Result:
Only unique rows remain, with duplicates removed.
Final table (duplicates removed)
EntityId | EntitySite | EntityType | ObjectId |
|---|---|---|---|
10021 | Investec | JOB | 696969 |
10411 | Investec | JOB | 254815 |
10026 | Lovells | PROPERTY | 676869 |
10215605 | MawsonCollins | PROPERTY | 102461 |
10539 | Source | JOB | 253044 |
1085 | Living Room | PROPERTY | 221915 |
10-Labourers | RavenscroftConstruction | JOB | 253328 |
Summary
This pattern is simple, reliable, and easy to adapt whether you’re cleaning a staging table or tightening controls in production pipelines. Keep it in your toolkit, it’s one of those small fixes that saves big headaches later.