How to Turn Messy Data into Something You Can Trust

Why Data Cleaning Matters

You can’t build a solid argument on a shaky foundation — and raw data is often messy. Before you can analyze anything meaningfully, you have to clean it.

Even the best-designed survey can produce:

Empty cells where people skipped questions
Duplicate responses from the same person
Typos or inconsistent spelling
Mixed formats (e.g., “15 yrs“ vs. “15“)

Just like tidying your room before you find something, cleaning your data makes the patterns visible and your conclusions more trustworthy.

If you skip cleaning, you risk getting wrong answers — or worse, misleading ones.

What is “Dirty” Data?

“Dirty data” is any kind of information that has problems — either because it’s incomplete, inconsistent, inaccurate, or just plain confusing. Even small mistakes in a dataset can lead to major misinterpretations, especially when you’re trying to tell a story, find a pattern, or make a decision.

Common problems in dirty data:

Duplicates: the same response or person shows up twice
Missing data: empty fields where a question was skipped
Inconsistent entries: different spellings or formats for the same thing
False or made-up answers: someone wrote “lol” for “age”
Outliers or errors: someone put “200 hrs” of screen time for a week

What is "Clean” Data?

Clean data is:

Organized
Consistent
Complete (or transparently incomplete)
Formatted in a way that is ready for analysis
Free of obvious bias, clutter, or confusion
Easy to interpret

Think of clean data as the equivalent as proofreading your essay: It’s still your voice, but now people can actually read and understand it.

How to Clean Data

Cleaning data is like editing a video before posting it — it’s not about changing the truth, it’s about making it clear, usable, and fair.

- Same person filled out your survey twice? Delete the extra.
- TikTok trend counted twice? Pick one version
Keep your analysis from being skewed by repeats.
- Remove the row — if it’s mostly blank or unreliable
- Leave it blank — but be transparent in your analysis
- Fill in with a placeholder or average — only if appropriate and clearly stated
- Spelling and capitalization
- Date formats (e.g. 7/4/2025 vs. July 4, 2025)
- Numbers vs. words (e.g. “ten” vs. “10”)
- Grades (“9th,“ “Grade 9,” “Freshman”)

- Very high or low numbers that don’t make sense
- Responses that clearly break the question rules
- Age 700? Screen time of 1000 hours a week?
Mention the data in your analysis as a likely outlier.
Reorganize open-ended responses into categories while preserving meaning.
- Don’t over-simplify
- Don’t twist the meaning
- Do group for clarity
- What you changed
- Why you changed it
- What might still be incomplete or biased
This builds transparency, credibility, and honesty into your work.

watch:

What is Data Cleaning?

Download PDF

Add a short summary or a list of helpful resources here.

Collect, Clean, Repeat

Why Data Cleaning Matters

What is “Dirty” Data?

What is "Clean” Data?

How to Clean Data

watch:

What is Data Cleaning?

PatternLab

.

Collect, Clean, Repeat

Why Data Cleaning Matters

What is “Dirty” Data?

What is "Clean” Data?

How to Clean Data

Check for Duplicates

Handle Missing Data

Standardize Formats

Outliers & Errors

Recode & Categorize

Document Changes

watch:

What is Data Cleaning?

Additional Resources

PatternLab

.