Efficient Data Deduplication and Null Value Handling in SQL Server

July 3, 2023
3 min read

Data merging and deduplication are tasks that I frequently encounter. Today, I encountered a situation where I needed to combine 40 separate Excel and CSV files into one cohesive and usable dataset. In this blog post, I will share my experience of tackling this task and explain the steps I took to achieve a successful outcome. From creating a master Excel file to importing the data into SQL Server and utilizing efficient techniques for data cleansing and deduplication, I will guide you through the entire process.

I first came up with common set of fields and created a master xlsx.  After the laborious task of merging all the files, I imported the file into SQL Server. There are a variety of ways to import data, this is just what I did today.  

Now with my data in a table, I could run SQL scripts to cleanse my data.  Since the original data was coming from a bunch of files, I figured there might be some overlap.  I did a few quick queries and found that to be the case.  I noticed that there were duplicate email addresses in the table, however, some of the extra fields were NULL and some has values.  My duty is to pull a list of records, de-dupped by email address, that have as much data as possible from the other fields.

I could so some sort of loop or cursor and GROUP BY email, but my table has about a million records and either of those options would take forever.  As it turns out, you can use COALESCE and pull the list in one quick query.

Let's try it out.

 

Example

Let's start out by creating a TABLE and filling it up with some test data:

CREATE TABLE dbo.tablename (
	email nvarchar(50) NOT NULL, 
	firstname VARCHAR(50) NULL, 
	lastname VARCHAR(50) NULL, 
	company NVARCHAR(100) NULL,
	title NVARCHAR(100) NULL
)
INSERT INTO tablename (email, firstname, lastname, company, title) VALUES
  ('[email protected]', 'Jennifer', 'Doe', NULL, NULL),
  ('[email protected]', 'Steve', NULL, NULL, 'Janitor'),
  ('[email protected]', NULL, NULL, 'Clean Up', 'Janitor'),
  ('[email protected]', 'Steve', 'Doe', 'Tidy Up', 'SEO')
  

Now notice that we have 2 duplicate email addresses with varying data in each record.  My goal is to only pull each email address once, but also grab all the data even though it's spread out over multiple records.  COALESCE and GROUP BY to the rescue:

SELECT email,
	COALESCE(MIN(firstname), MIN(firstname)) AS firstname,
	COALESCE(MIN(lastname), MIN(lastname)) AS lastname,
	COALESCE(MIN(company), MIN(company)) AS company,
	COALESCE(MIN(title), MIN(title)) AS title
FROM dbo.tablename 
GROUP BY email
ORDER BY email


There you go.  It's now properly pulling:

emailfirstnamelastnamecompanytitle
[email protected]JenniferDoeClean UpJanitor
[email protected]SteveDoeTidy UpJanitor

 

Hope you learned something new and it saves you time.