Bragi Docs Help

SQL tips: de-duping a table

Duplicate rows are one of those headaches every data engineer runs into. They creep in through messy source files, flawed joins, or upstream jobs that didn’t quite behave. Once inside, they can throw off reports, inflate metrics, and generally waste compute.

The tricky part isn’t just removing duplicates, but making sure you’re keeping the right record.

We’ll walk through a clean, repeatable way to remove duplicates using SQL window functions.

Setting up the example

The idea is simple. We'll start by identifying the duplicates with ROW_NUMBER(). Then, we keep the first record per group based on our business key, and safely delete the rest, leaving a tidy dataset.

Let's create a sample table with some duplicates to play with:

CREATE TABLE MBDupes ( EntityId NVARCHAR(128) NOT NULL, EntitySite NVARCHAR(128) NOT NULL, EntityType NVARCHAR(32) NOT NULL, ObjectId INT NOT NULL ); GO -- Insert data including duplicates INSERT INTO MBDupes(EntityId, EntitySite, EntityType, ObjectId) VALUES (N'10021', N'Investec', N'JOB', 696969), (N'10411', N'Investec', N'JOB', 254815), (N'10026', N'Lovells', N'PROPERTY', 676869), (N'10215605', N'MawsonCollins', N'PROPERTY', 102461), (N'10539', N'Source', N'JOB', 253044), (N'1085', N'Living Room', N'PROPERTY', 221915), (N'10-Labourers', N'RavenscroftConstruction', N'JOB', 253328), -- Our duplicate rows (N'10215605', N'MawsonCollins', N'PROPERTY', 1269), (N'10539', N'Source', N'JOB', 6987), (N'10539', N'Source', N'JOB', 6969);

Here's the data setup (duplicates made bold):

EntityId

EntitySite

EntityType

ObjectId

10021

Investec

JOB

696969

10411

Investec

JOB

254815

1085

Living Room

PROPERTY

221915

10026

Lovells

PROPERTY

676869

10215605

MawsonCollins

PROPERTY

1269

10215605

MawsonCollins

PROPERTY

102461

10-Labourers

RavenscroftConstruction

JOB

253328

10539

Source

JOB

6969

10539

Source

JOB

6987

10539

Source

JOB

253044

The approach: tag, then delete

We use the ROW_NUMBER() window function to assign a sequential number to each row within the same business key (in this case, EntitySite and EntityId). Duplicates get a number greater than 1.

SELECT EntitySite, EntityId, EntityType, ObjectId, ROW_NUMBER() OVER (PARTITION BY EntitySite, EntityId ORDER BY ObjectId) AS row_num FROM dbo.MBDupes;

Sample output:

EntitySite

EntityId

EntityType

ObjectId

row_num

Investec

10021

JOB

696969

1

Investec

10411

JOB

254815

1

Living Room

1085

PROPERTY

221915

1

Lovells

10026

PROPERTY

676869

1

MawsonCollins

10215605

PROPERTY

1269

1

MawsonCollins

10215605

PROPERTY

102461

2

RavenscroftConstruction

10-Labourers

JOB

253328

1

Source

10539

JOB

6969

1

Source

10539

JOB

6987

2

Source

10539

JOB

253044

3

Deleting the duplicates

Using a common table expression (CTE), we delete all rows with row_num > 1, keeping only the first occurrence of each duplicate set, prioritising by ObjectId descending (keeping the latest):

WITH cte AS ( SELECT EntitySite, EntityId, ROW_NUMBER() OVER (PARTITION BY EntitySite, EntityId ORDER BY ObjectId DESC) AS row_num FROM dbo.MBDupes ) DELETE FROM cte WHERE row_num > 1;

Result:

Only unique rows remain, with duplicates removed.

Final table (duplicates removed)

EntityId

EntitySite

EntityType

ObjectId

10021

Investec

JOB

696969

10411

Investec

JOB

254815

10026

Lovells

PROPERTY

676869

10215605

MawsonCollins

PROPERTY

102461

10539

Source

JOB

253044

1085

Living Room

PROPERTY

221915

10-Labourers

RavenscroftConstruction

JOB

253328

Summary

This pattern is simple, reliable, and easy to adapt whether you’re cleaning a staging table or tightening controls in production pipelines. Keep it in your toolkit, it’s one of those small fixes that saves big headaches later.

09 March 2026