In the age of Agentic AI - where autonomous systems launch campaigns, personalise content, and trigger actions without human approval - data quality is everything.
These AI systems act on what's in your CRM, ad platform, or analytics table. If that data is outdated, duplicated, or inconsistent, they don't just make small mistakes - they make them at scale.
And yes, the results from these autonomous systems aren’t always reliable - hallucinations happen, decisions occasionally miss the mark. But the pace of progress is fast. What feels experimental today could become a default workflow tomorrow.
That’s why it’s time to get serious about your data. If your team hasn’t made it a priority yet, this is a good place to start. Clean, consistent, reliable data isn’t a nice-to-have anymore - it’s the baseline for modern, AI - powered marketing.
According to Gartner, poor data quality costs organisations an average of $12.9 million annually, cutting into efficiency and clouding decision - making. And with AI now driving more of the day-to-day execution - from content to targeting - those costs are likely to climb. Bad data doesn't just stay in the background any more; it gets amplified.
I wrote this article as a practical SQL query cookbook for marketers using SQL Server Management Studio (SSMS) from Microsoft. Whether you’re technical or just data - curious, these examples will help you solve common issues fast. And with ChatGPT - or any large language model (LLM) - writing and explaining the queries, you don’t need deep SQL knowledge to get started.
With clean data and the ability to work with it directly, you’ll move faster - and your AI will, too.
You Don’t Need to Be a Developer Anymore
Historically, data clean-up in SQL required technical support. That’s no longer true.
Thanks to ChatGPT, marketers can now:
- Describe a data issue in plain English
- Get a SQL query instantly
- Run or adapt it in SQL Server Management Studio
- Ask ChatGPT to explain or rewrite the query for a different structure
Even if you only understand basic SQL - how SELECT, WHERE, and JOIN work - you can start writing powerful queries and learning as you go.
Basic SQL + ChatGPT = a fast track to becoming a professional, SQL - literate marketer.
How to Use This Cookbook
- Open SQL Server Management Studio (SSMS)
- Copy a query below
- Adjust table or column names to match your schema
- Run a SELECT to preview changes before applying DELETE or UPDATE
- Use ChatGPT to adapt or explain any query
Note: All queries use SELECT so you can preview the changes first. Always confirm before converting to UPDATE or DELETE.
🍳 The Marketing Data Cleaning Query Cookbook
Name Formatting
1. Fix Name Capitalization (Proper Case)
SELECT
first_name,
last_name,
UPPER(LEFT(first_name, 1)) + LOWER(SUBSTRING(first_name, 2, LEN(first_name))) AS fixed_first_name,
UPPER(LEFT(last_name, 1)) + LOWER(SUBSTRING(last_name, 2, LEN(last_name))) AS fixed_last_name
FROM contacts
WHERE first_name IS NOT NULL AND last_name IS NOT NULL;
This turns:
- JOHN → John
- sMiTh → Smith
- anna → Anna
2. Flag Suspicious or Swapped Names
Flags names that look unusual in length (as a clue for errors like swapped fields or invalid entries).
SELECT *
FROM contacts
WHERE LEN(first_name) > 15 OR LEN(last_name) < 2;
This flags:
- first_name = JOHNSONANDJOHNSON, last_name = A
- first_name = ADMINACCOUNT, last_name = x
3. Build Full Name from First and Last Name
SELECT
first_name,
last_name,
UPPER(LEFT(first_name, 1)) + LOWER(SUBSTRING(first_name, 2, LEN(first_name))) + ' ' +
UPPER(LEFT(last_name, 1)) + LOWER(SUBSTRING(last_name, 2, LEN(last_name))) AS full_name
FROM contacts;
This turns:
- first_name = JOHN, last_name = SMITH → full_name = John Smith
- first_name = aNNa, last_name = bROWN → full_name = Anna Brown
4. Trim Extra Spaces
Remove leading/trailing spaces.
SELECT
LTRIM(RTRIM(first_name)) AS trimmed_first_name,
LTRIM(RTRIM(last_name)) AS trimmed_last_name
FROM contacts;
This turns:
- ' John ' → 'John'
- ' smith ' → 'smith'
Fuzzy matching and sound-based duplicates
Find similar-sounding duplicates using SOUNDEX + DIFFERENCE
Compares phonetically similar names. Example: Jon Smith
and John Smyth
WITH PossibleDupes AS (
SELECT
a.contact_id AS id1,
b.contact_id AS id2,
a.first_name AS name1,
b.first_name AS name2,
a.email AS email1,
b.email AS email2,
ROW_NUMBER() OVER (PARTITION BY a.first_name ORDER BY a.last_updated DESC) AS rn
FROM contacts a
JOIN contacts b
ON a.contact_id < b.contact_id
AND SOUNDEX(a.first_name) = SOUNDEX(b.first_name)
AND DIFFERENCE(a.last_name, b.last_name) >= 3
)
SELECT * FROM contacts
WHERE contact_id IN (
SELECT id2 FROM PossibleDupes WHERE rn > 1
);
This flags potential duplicates like:
- Jon Smith and John Smyth → same SOUNDEX and high DIFFERENCE score
- Sara Adams and Sarah Adamz → close pronunciation, potential duplicate
- Keeps only the most recent entry by last_updated and flags others for review
Caution: Fuzzy logic can create false positives. Always inspect the results before deciding on deduplication.
String unification
1. Add leading zeros to numbers
Ensures that numeric fields are always the same length (e.g., 5-digit ZIPs, 8-digit IDs) by padding with leading zeros.
SELECT
zip_code,
REPLICATE('0', 5 - LEN(zip_code)) + zip_code AS padded_zip
FROM leads
WHERE LEN(zip_code) < 5;
This turns:
- 123 → '00123'
- 7890 → '07890'
- 9 → '00009'
2. Standardize text using REPLICATE + UPPER
Standardizes values with consistent formatting — great for codes, campaign labels, or any inconsistent text input.
SELECT
campaign_code,
UPPER(REPLICATE('0', 5 - LEN(campaign_code)) + campaign_code) AS formatted_code
FROM campaigns
WHERE LEN(campaign_code) < 5
OR campaign_code COLLATE Latin1_General_CS_AS != UPPER(campaign_code);
This turns:
- abc → 00ABC
- x9 → 000X9
3. Standardize country names
Maps common country variants to a standard form.
SELECT
country,
CASE
WHEN country IN ('United States', 'US', 'U.S.') THEN 'USA'
ELSE country
END AS standardized_country
FROM contacts;
This turns:
- 'United States' → 'USA'
- 'US' → 'USA'
- 'U.S.' → 'USA'
4. Standardize Phone Numbers (e.g. US)
Formats 10-digit numbers for readability.
SELECT
phone,
'(' + SUBSTRING(phone, 1, 3) + ') ' + SUBSTRING(phone, 4, 3) + '-' + SUBSTRING(phone, 7, 4) AS formatted_phone
FROM contacts
WHERE LEN(phone) = 10 AND phone NOT LIKE '(%';
This turns:
- '2125551234' → '(212) 555-1234'
5. Classify free vs business email providers
SELECT
email,
CASE
WHEN email LIKE '%gmail.com%' OR email LIKE '%yahoo.com%' OR email LIKE '%hotmail.com%' THEN 'free'
ELSE 'business'
END AS email_type
FROM leads;
This flags:
- gmail.com, yahoo.com, hotmail.com emails as 'free'
Note: This is a preview-only query—it classifies but does not modify your data
6. Normalize job titles
SELECT
job_title,
CASE
WHEN LOWER(job_title) LIKE '%manager%'
OR LOWER(job_title) LIKE '%lead%'
OR LOWER(job_title) LIKE '%head%' THEN 'Manager'
ELSE 'Other'
END AS job_level
FROM contacts;
This classifies:
- 'Product Manager', 'Team Lead', 'Head of Growth' → 'Manager' level
Dealing with Missing Data
1. Find records with missing critical fields
SELECT *
FROM contacts
WHERE email IS NULL OR first_name IS NULL OR last_name IS NULL;
This finds:
- Contacts missing any of the key identity fields: email, first name, or last name
2. Fill missing values with defaults using ISNULL
Fill Missing Industry with 'Unknown'
SELECT
industry,
ISNULL(industry, 'Unknown') AS cleaned_industry
FROM leads;
Fill Missing UTM Medium with 'email'
SELECT
utm_medium,
ISNULL(utm_medium, 'email') AS cleaned_medium
FROM sessions;
Fill NULL Lead Scores with 0
SELECT
lead_score,
ISNULL(lead_score, 0) AS final_score
FROM leads;
This turns:
- NULL → 'Unknown'
- NULL → 'email'
- NULL → 0 (helps avoid failed scoring logic)
3. Use COALESCE to create fallback values
Combine full name from available fields.
SELECT
full_name,
first_name,
last_name,
COALESCE(full_name, first_name + ' ' + last_name, 'Anonymous') AS display_name
FROM contacts;
This turns:
- full_name = NULL, first_name = 'John', last_name = 'Smith' → 'John Smith'
- All fields NULL → 'Anonymous'
Fill missing company from backup field
SELECT
company,
employer,
COALESCE(company, employer, 'Unknown') AS known_company
FROM contacts;
Avoiding Duplicate Data
1. Select duplicate contacts by email
WITH RankedContacts AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY Create_Date DESC) AS rn
FROM contacts
)
SELECT * FROM RankedContacts
WHERE rn > 1;
This finds:
- Multiple records with the same email: [email protected] (3 entries).
2. Detect Duplicate Full Names
WITH RankedNames AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY created_at DESC) AS rn
FROM contacts
)
SELECT *
FROM RankedNames
WHERE rn > 1;
This flags:
- People with the same name entered multiple times.
3. Detect duplicate email + name combo
WITH ComboDupes AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY email, first_name, last_name
ORDER BY contact_id
) AS rn
FROM contacts
)
SELECT *
FROM ComboDupes
WHERE rn > 1;
This identifies:
- Repeat entries for the same person based on name and email.
Out-of-range values and inaccurate data
1. Find out-of-range values
Flags unrealistic values (age, lead score, conversion rate).
Example:conversion_rate > 1
or age < 18
SELECT *
FROM leads
WHERE age NOT BETWEEN 18 AND 100
OR lead_score NOT BETWEEN 0 AND 100
OR conversion_rate > 1;
This flags:
- age = 7 → flagged (too young)
- lead_score = 130 → flagged (should be between 0 and 100)
- conversion_rate = 1.2 → flagged (over 100% is not valid)
2. Detect contradictory values
Example:
- Email opened after unsubscribe
- UK phone number in US record
Lead score = 0 but multiple conversions
SELECT *
FROM contacts
WHERE (email_status = 'Unsubscribed' AND email_opened = 1)
OR (country = 'USA' AND phone LIKE '+44%')
OR (lead_score = 0 AND conversion_count > 0);
This flags:
- email_status = 'Unsubscribed' and email_opened = 1 → person opened email after opting out
- country = 'USA' and phone = '+447911123456' → UK number in US segment
- lead_score = 0 and conversion_count = 3 → score doesn't reflect activity
3. Detect invalid emails
Flags values without @
or domain.
SELECT *
FROM leads
WHERE email NOT LIKE '%@%.%' OR email IS NULL OR email = '';
This finds:
- ‘johnexample.com’
- NULL
- ““ (empty string)
4. Find internal or test contacts
SELECT *
FROM contacts
WHERE email LIKE '%test%' OR email LIKE '%internal%' OR name LIKE '%demo%';
This flags:
- internal accounts used for QA/testing
- [email protected]
5. Detect invalid dates
SELECT *
FROM events
WHERE event_date > GETDATE() OR event_date IS NULL;
This finds:
- Future dates
- Empty or null event dates
Splitting data of one column into more columns
1. Split Full Name into First and Last
SELECT
full_name,
LEFT(full_name, CHARINDEX(' ', full_name) - 1) AS first_name,
SUBSTRING(full_name, CHARINDEX(' ', full_name) + 1, LEN(full_name)) AS last_name
FROM contacts
WHERE full_name LIKE '% %';
This turns:
- 'John Smith' → first_name = 'John', last_name = 'Smith'
2. Split Location Field into City and Country
Assuming format: ”City, Country”
SELECT
location,
LEFT(location, CHARINDEX(',', location) - 1) AS city,
LTRIM(SUBSTRING(location, CHARINDEX(',', location) + 1, LEN(location))) AS country
FROM leads
WHERE location LIKE '%,%';
This turns:
- Paris, France → city = ‘Paris’, country =’France’
Bonus: Train GPT on Your Schema
For even more accurate and relevant SQL, create a custom GPT assistant trained on:
- A list of your CRM’s table and column names
- Common cleaning queries from your team
- Notes on formatting rules or naming patterns
Now ChatGPT doesn’t just write generic SQL. It writes your SQL.
Final Thoughts
Agentic AI is changing how marketers work. It’s fast, autonomous, and powerful—but it depends entirely on the quality of your data.
Thanks to SQL and ChatGPT, marketers can now clean that data themselves—without bottlenecks or technical debt. You don’t need to become an engineer. You just need:
- The right questions
- The right queries
- And the right assistant
Use this cookbook. Prompt ChatGPT. Learn as you go.
Clean data isn’t just an optimization anymore. It’s the key to making AI work the way it’s supposed to.