How to Turn Messy Data into Something You Can Trust

Why Data Cleaning Matters

You can’t build a solid argument on a shaky foundation — and raw data is often messy. Before you can analyze anything meaningfully, you have to clean it.

Even the best-designed survey can produce:

  • Empty cells where people skipped questions

  • Duplicate responses from the same person

  • Typos or inconsistent spelling

  • Mixed formats (e.g., “15 yrs“ vs. “15“)

Just like tidying your room before you find something, cleaning your data makes the patterns visible and your conclusions more trustworthy.

If you skip cleaning, you risk getting wrong answers — or worse, misleading ones.

What is “Dirty” Data?

“Dirty data” is any kind of information that has problems — either because it’s incomplete, inconsistent, inaccurate, or just plain confusing. Even small mistakes in a dataset can lead to major misinterpretations, especially when you’re trying to tell a story, find a pattern, or make a decision.

Common problems in dirty data:

  • Duplicates: the same response or person shows up twice

  • Missing data: empty fields where a question was skipped

  • Inconsistent entries: different spellings or formats for the same thing

  • False or made-up answers: someone wrote “lol” for “age”

  • Outliers or errors: someone put “200 hrs” of screen time for a week

What is "Clean” Data?

Clean data is:

  • Organized

  • Consistent

  • Complete (or transparently incomplete)

  • Formatted in a way that is ready for analysis

  • Free of obvious bias, clutter, or confusion

  • Easy to interpret

Think of clean data as the equivalent as proofreading your essay: It’s still your voice, but now people can actually read and understand it.

How to Clean Data

Cleaning data is like editing a video before posting it — it’s not about changing the truth, it’s about making it clear, usable, and fair.

    • Same person filled out your survey twice? Delete the extra.

    • TikTok trend counted twice? Pick one version

    Keep your analysis from being skewed by repeats.

    • Remove the row — if it’s mostly blank or unreliable

    • Leave it blank — but be transparent in your analysis

    • Fill in with a placeholder or average — only if appropriate and clearly stated

    • Spelling and capitalization

    • Date formats (e.g. 7/4/2025 vs. July 4, 2025)

    • Numbers vs. words (e.g. “ten” vs. “10”)

    • Grades (“9th,“ “Grade 9,” “Freshman”)

    • Very high or low numbers that don’t make sense

    • Responses that clearly break the question rules

    • Age 700? Screen time of 1000 hours a week?

    Mention the data in your analysis as a likely outlier.

  • Reorganize open-ended responses into categories while preserving meaning.

    • Don’t over-simplify

    • Don’t twist the meaning

    • Do group for clarity

    • What you changed

    • Why you changed it

    • What might still be incomplete or biased

    This builds transparency, credibility, and honesty into your work.

watch:

What is Data Cleaning?

Download PDF
  • Add a short summary or a list of helpful resources here.