sia.hackernoon.com

In the age of Agentic AI - where autonomous systems launch campaigns, personalise content, and trigger actions without human approval - data quality is everything.

These AI systems act on what's in your CRM, ad platform, or analytics table. If that data is outdated, duplicated, or inconsistent, they don't just make small mistakes - they make them at scale.

And yes, the results from these autonomous systems aren’t always reliable - hallucinations happen, decisions occasionally miss the mark. But the pace of progress is fast. What feels experimental today could become a default workflow tomorrow.

That’s why it’s time to get serious about your data. If your team hasn’t made it a priority yet, this is a good place to start. Clean, consistent, reliable data isn’t a nice-to-have anymore - it’s the baseline for modern, AI - powered marketing.

According to Gartner, poor data quality costs organisations an average of $12.9 million annually, cutting into efficiency and clouding decision - making. And with AI now driving more of the day-to-day execution - from content to targeting - those costs are likely to climb. Bad data doesn't just stay in the background any more; it gets amplified.

I wrote this article as a practical SQL query cookbook for marketers using SQL Server Management Studio (SSMS) from Microsoft. Whether you’re technical or just data - curious, these examples will help you solve common issues fast. And with ChatGPT - or any large language model (LLM) - writing and explaining the queries, you don’t need deep SQL knowledge to get started.

With clean data and the ability to work with it directly, you’ll move faster - and your AI will, too.

You Don’t Need to Be a Developer Anymore

Historically, data clean-up in SQL required technical support. That’s no longer true.

Thanks to ChatGPT, marketers can now:

Describe a data issue in plain English
Get a SQL query instantly
Run or adapt it in SQL Server Management Studio
Ask ChatGPT to explain or rewrite the query for a different structure

Even if you only understand basic SQL - how SELECT, WHERE, and JOIN work - you can start writing powerful queries and learning as you go.

Basic SQL + ChatGPT = a fast track to becoming a professional, SQL - literate marketer.

How to Use This Cookbook

Open SQL Server Management Studio (SSMS)
Copy a query below
Adjust table or column names to match your schema
Run a SELECT to preview changes before applying DELETE or UPDATE
Use ChatGPT to adapt or explain any query

Note: All queries use SELECT so you can preview the changes first. Always confirm before converting to UPDATE or DELETE.

🍳 The Marketing Data Cleaning Query Cookbook

Name Formatting

1. Fix Name Capitalization (Proper Case)

SELECT 
  first_name,
  last_name,
  UPPER(LEFT(first_name, 1)) + LOWER(SUBSTRING(first_name, 2, LEN(first_name))) AS fixed_first_name,
  UPPER(LEFT(last_name, 1)) + LOWER(SUBSTRING(last_name, 2, LEN(last_name))) AS fixed_last_name
FROM contacts
WHERE first_name IS NOT NULL AND last_name IS NOT NULL;

This turns:

JOHN → John
sMiTh → Smith
anna → Anna

2. Flag Suspicious or Swapped Names

Flags names that look unusual in length (as a clue for errors like swapped fields or invalid entries).

SELECT *
FROM contacts
WHERE LEN(first_name) > 15 OR LEN(last_name) < 2;

This flags:

first_name = JOHNSONANDJOHNSON, last_name = A
first_name = ADMINACCOUNT, last_name = x

3. Build Full Name from First and Last Name

SELECT 
  first_name,
  last_name,
  UPPER(LEFT(first_name, 1)) + LOWER(SUBSTRING(first_name, 2, LEN(first_name))) + ' ' +
  UPPER(LEFT(last_name, 1)) + LOWER(SUBSTRING(last_name, 2, LEN(last_name))) AS full_name
FROM contacts;

This turns:

first_name = JOHN, last_name = SMITH → full_name = John Smith
first_name = aNNa, last_name = bROWN → full_name = Anna Brown

4. Trim Extra Spaces

Remove leading/trailing spaces.

SELECT 
  LTRIM(RTRIM(first_name)) AS trimmed_first_name,
  LTRIM(RTRIM(last_name)) AS trimmed_last_name
FROM contacts;

This turns:

' John ' → 'John'
' smith ' → 'smith'

Fuzzy matching and sound-based duplicates

Find similar-sounding duplicates using SOUNDEX + DIFFERENCE

Compares phonetically similar names. Example: Jon Smith and John Smyth

WITH PossibleDupes AS (
  SELECT 
    a.contact_id AS id1,
    b.contact_id AS id2,
    a.first_name AS name1,
    b.first_name AS name2,
    a.email AS email1,
    b.email AS email2,
    ROW_NUMBER() OVER (PARTITION BY a.first_name ORDER BY a.last_updated DESC) AS rn
  FROM contacts a
  JOIN contacts b
    ON a.contact_id < b.contact_id
   AND SOUNDEX(a.first_name) = SOUNDEX(b.first_name)
   AND DIFFERENCE(a.last_name, b.last_name) >= 3
)
SELECT * FROM contacts
WHERE contact_id IN (
  SELECT id2 FROM PossibleDupes WHERE rn > 1
);

This flags potential duplicates like:

Jon Smith and John Smyth → same SOUNDEX and high DIFFERENCE score
Sara Adams and Sarah Adamz → close pronunciation, potential duplicate
Keeps only the most recent entry by last_updated and flags others for review

Caution: Fuzzy logic can create false positives. Always inspect the results before deciding on deduplication.

String unification

1. Add leading zeros to numbers

Ensures that numeric fields are always the same length (e.g., 5-digit ZIPs, 8-digit IDs) by padding with leading zeros.

SELECT 
  zip_code,
  REPLICATE('0', 5 - LEN(zip_code)) + zip_code AS padded_zip
FROM leads
WHERE LEN(zip_code) < 5;

This turns:

123 → '00123'
7890 → '07890'
9 → '00009'

2. Standardize text using REPLICATE + UPPER

Standardizes values with consistent formatting — great for codes, campaign labels, or any inconsistent text input.

SELECT 
  campaign_code,
  UPPER(REPLICATE('0', 5 - LEN(campaign_code)) + campaign_code) AS formatted_code
FROM campaigns
WHERE LEN(campaign_code) < 5 
   OR campaign_code COLLATE Latin1_General_CS_AS != UPPER(campaign_code);

This turns:

abc → 00ABC
x9 → 000X9

3. Standardize country names

Maps common country variants to a standard form.

SELECT 
  country,
  CASE 
    WHEN country IN ('United States', 'US', 'U.S.') THEN 'USA' 
    ELSE country 
  END AS standardized_country
FROM contacts;

This turns:

'United States' → 'USA'
'US' → 'USA'
'U.S.' → 'USA'

4. Standardize Phone Numbers (e.g. US)

Formats 10-digit numbers for readability.

SELECT 
  phone,
  '(' + SUBSTRING(phone, 1, 3) + ') ' + SUBSTRING(phone, 4, 3) + '-' + SUBSTRING(phone, 7, 4) AS formatted_phone
FROM contacts
WHERE LEN(phone) = 10 AND phone NOT LIKE '(%';

This turns:

'2125551234' → '(212) 555-1234'

5. Classify free vs business email providers

SELECT 
  email,
  CASE 
    WHEN email LIKE '%gmail.com%' OR email LIKE '%yahoo.com%' OR email LIKE '%hotmail.com%' THEN 'free' 
    ELSE 'business' 
  END AS email_type
FROM leads;

This flags:

gmail.com, yahoo.com, hotmail.com emails as 'free'

Note: This is a preview-only query—it classifies but does not modify your data

6. Normalize job titles

SELECT 
  job_title,
  CASE 
    WHEN LOWER(job_title) LIKE '%manager%' 
      OR LOWER(job_title) LIKE '%lead%' 
      OR LOWER(job_title) LIKE '%head%' THEN 'Manager' 
    ELSE 'Other' 
  END AS job_level
FROM contacts;

This classifies:

'Product Manager', 'Team Lead', 'Head of Growth' → 'Manager' level

Dealing with Missing Data

1. Find records with missing critical fields

SELECT *
FROM contacts
WHERE email IS NULL OR first_name IS NULL OR last_name IS NULL;

This finds:

Contacts missing any of the key identity fields: email, first name, or last name

2. Fill missing values with defaults using ISNULL

Fill Missing Industry with 'Unknown'

SELECT 
  industry,
  ISNULL(industry, 'Unknown') AS cleaned_industry
FROM leads;

Fill Missing UTM Medium with 'email'

SELECT 
  utm_medium,
  ISNULL(utm_medium, 'email') AS cleaned_medium
FROM sessions;

Fill NULL Lead Scores with 0

SELECT 
  lead_score,
  ISNULL(lead_score, 0) AS final_score
FROM leads;

This turns:

NULL → 'Unknown'
NULL → 'email'
NULL → 0 (helps avoid failed scoring logic)

3. Use COALESCE to create fallback values

Combine full name from available fields.

SELECT 
  full_name,
  first_name,
  last_name,
  COALESCE(full_name, first_name + ' ' + last_name, 'Anonymous') AS display_name
FROM contacts;

This turns:

full_name = NULL, first_name = 'John', last_name = 'Smith' → 'John Smith'
All fields NULL → 'Anonymous'

Fill missing company from backup field

SELECT 
  company,
  employer,
  COALESCE(company, employer, 'Unknown') AS known_company
FROM contacts;

Avoiding Duplicate Data

1. Select duplicate contacts by email

WITH RankedContacts AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY email ORDER BY Create_Date DESC) AS rn
  FROM contacts
)
SELECT * FROM RankedContacts 
WHERE rn > 1;

This finds:

Multiple records with the same email: [email protected] (3 entries).

2. Detect Duplicate Full Names

WITH RankedNames AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY created_at DESC) AS rn
  FROM contacts
)
SELECT *
FROM RankedNames
WHERE rn > 1;

This flags:

People with the same name entered multiple times.

3. Detect duplicate email + name combo

WITH ComboDupes AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY email, first_name, last_name 
           ORDER BY contact_id
         ) AS rn
  FROM contacts
)
SELECT *
FROM ComboDupes
WHERE rn > 1;

This identifies:

Repeat entries for the same person based on name and email.

Out-of-range values and inaccurate data

1. Find out-of-range values

Flags unrealistic values (age, lead score, conversion rate).
Example:conversion_rate > 1 or age < 18

SELECT *
FROM leads
WHERE age NOT BETWEEN 18 AND 100
   OR lead_score NOT BETWEEN 0 AND 100
   OR conversion_rate > 1;

This flags:

age = 7 → flagged (too young)
lead_score = 130 → flagged (should be between 0 and 100)
conversion_rate = 1.2 → flagged (over 100% is not valid)

2. Detect contradictory values

Example:

Email opened after unsubscribe
UK phone number in US record

Lead score = 0 but multiple conversions

SELECT *
FROM contacts
WHERE (email_status = 'Unsubscribed' AND email_opened = 1)
   OR (country = 'USA' AND phone LIKE '+44%')
   OR (lead_score = 0 AND conversion_count > 0);

This flags:

email_status = 'Unsubscribed' and email_opened = 1 → person opened email after opting out
country = 'USA' and phone = '+447911123456' → UK number in US segment
lead_score = 0 and conversion_count = 3 → score doesn't reflect activity

3. Detect invalid emails

Flags values without @ or domain.

SELECT *
FROM leads
WHERE email NOT LIKE '%@%.%' OR email IS NULL OR email = '';

This finds:

‘johnexample.com’
NULL
““ (empty string)

4. Find internal or test contacts

SELECT *
FROM contacts
WHERE email LIKE '%test%' OR email LIKE '%internal%' OR name LIKE '%demo%';

This flags:

internal accounts used for QA/testing
[email protected]

5. Detect invalid dates

SELECT *
FROM events
WHERE event_date > GETDATE() OR event_date IS NULL;

This finds:

Future dates
Empty or null event dates

Splitting data of one column into more columns

1. Split Full Name into First and Last

SELECT 
  full_name,
  LEFT(full_name, CHARINDEX(' ', full_name) - 1) AS first_name,
  SUBSTRING(full_name, CHARINDEX(' ', full_name) + 1, LEN(full_name)) AS last_name
FROM contacts
WHERE full_name LIKE '% %';

This turns:

'John Smith' → first_name = 'John', last_name = 'Smith'

2. Split Location Field into City and Country

Assuming format: ”City, Country”

SELECT 
  location,
  LEFT(location, CHARINDEX(',', location) - 1) AS city,
  LTRIM(SUBSTRING(location, CHARINDEX(',', location) + 1, LEN(location))) AS country
FROM leads
WHERE location LIKE '%,%';

This turns:

Paris, France → city = ‘Paris’, country =’France’

Bonus: Train GPT on Your Schema

For even more accurate and relevant SQL, create a custom GPT assistant trained on:

A list of your CRM’s table and column names
Common cleaning queries from your team
Notes on formatting rules or naming patterns

Now ChatGPT doesn’t just write generic SQL. It writes your SQL.

Final Thoughts

Agentic AI is changing how marketers work. It’s fast, autonomous, and powerful—but it depends entirely on the quality of your data.

Thanks to SQL and ChatGPT, marketers can now clean that data themselves—without bottlenecks or technical debt. You don’t need to become an engineer. You just need:

The right questions
The right queries
And the right assistant

Use this cookbook. Prompt ChatGPT. Learn as you go.

Clean data isn’t just an optimization anymore. It’s the key to making AI work the way it’s supposed to.