Domain 1 β€” Module 9 of 11 82%
9 of 27 overall
Domain 1: Identify the Business Value of Generative AI Solutions Free ⏱ ~12 min read

Data Quality: The Make-or-Break Factor for AI

Every AI system is only as good as the data behind it. Learn the data quality dimensions that determine whether AI helps or harms β€” and how to assess your organisation's readiness.

Why does data quality matter more with AI?

Simple explanation

”Garbage in, garbage out” has been true in computing for decades. With AI, it’s β€œgarbage in, confidently wrong garbage out β€” at scale.”

Traditional software crashes or throws errors when data is bad. AI doesn’t. It takes your messy, incomplete, outdated data and produces polished, professional-looking output that seems correct β€” but isn’t. And it does it fast, across your entire organisation.

That’s why data quality isn’t a technical detail for your IT team. It’s a strategic priority for every leader deploying AI.

Data types: Structured, unstructured, and semi-structured

AI systems work with three types of data, each with different quality challenges:

Data types and their quality challenges
FeatureWhat it looks likeExamplesQuality challenge
Structured dataOrganised in rows and columns with defined formatsDatabases, spreadsheets, CRM records, financial transactionsMissing values, duplicate records, inconsistent formats (dates, currencies)
Unstructured dataNo predefined format β€” free-form contentEmails, documents, Teams chats, images, videos, meeting transcriptsOutdated content, contradictory versions, poor organisation, no metadata
Semi-structured dataHas some organisation but not rigid rows and columnsJSON files, XML data, tagged emails, SharePoint metadataInconsistent tagging, missing fields, schema variations across sources
Exam tip: Why unstructured data matters most for gen AI

Most enterprise data is unstructured β€” documents, emails, chats, presentations. This is exactly the data that generative AI (especially Copilot) grounds on.

The exam may test whether you understand that:

  • 80% of enterprise data is unstructured β€” and it’s the hardest to quality-check
  • Copilot primarily grounds on unstructured data via Microsoft Graph (emails, documents, chats)
  • Poor unstructured data quality directly leads to poor AI responses

Five dimensions of data quality

Leaders should evaluate data across five key dimensions before deploying AI:

DimensionWhat it meansAI impact if poorCheck
AccuracyData reflects reality correctlyAI provides factually wrong answers with high confidenceAre product specs, prices, and policies current and verified?
CompletenessNo critical gaps or missing fieldsAI can’t answer questions about missing topics β€” or fills in gaps with fabricationsAre all departments, products, and regions represented in the data?
TimelinessData is current and regularly updatedAI gives outdated answers β€” last year’s pricing, old policies, former employeesWhen was each document last reviewed? Is there a refresh schedule?
ConsistencySame information is recorded the same way across sourcesAI gets contradictory inputs and produces unpredictable responsesDoes the HR policy in SharePoint match the version in the employee handbook?
RelevanceData is appropriate for the AI use caseAI retrieves noise instead of signal β€” irrelevant content dilutes good answersIs the indexed content actually useful for the questions users will ask?

Representative datasets: Why they matter for fairness

A representative dataset reflects the full diversity of the population or scenarios the AI will encounter. If the training data or grounding data is skewed, the AI’s outputs will be biased.

ProblemWhat happensReal-world example
UnderrepresentationAI performs poorly for groups missing from the dataA hiring AI trained mostly on male resumes ranks female candidates lower
Historical biasData reflects past discrimination β€” AI perpetuates itA lending model trained on historical approvals denies loans to demographics that were historically discriminated against
Geographic skewData overrepresents certain regions or culturesA customer support AI trained on US data gives incorrect answers about EU regulations
Temporal biasTraining data is outdated, reflecting old patternsA market analysis AI recommends strategies based on pre-pandemic consumer behaviour
Why leaders β€” not just data scientists β€” need to care about representation

Representative datasets aren’t just a technical concern. They’re a governance and reputational risk:

  • Regulatory: The EU AI Act and similar regulations require AI systems to be tested for bias
  • Reputational: A biased AI in customer-facing applications can generate headlines
  • Legal: Discriminatory AI outputs can create liability

The board and C-suite need to ask: β€œDoes our data represent all the people and scenarios this AI will encounter?” If the answer is no, the AI isn’t ready for deployment.

Real-world scenario: Dr. Patel audits data quality before AI deployment

πŸ“Š Dr. Anisha Patel, Board Advisor, insists that her client’s organisation completes a data quality audit before rolling out Copilot to 3,000 employees. Here’s what the audit finds:

SharePoint:

  • 40% of documents haven’t been updated in over 2 years
  • Three versions of the employee handbook exist β€” with conflicting information
  • The old intranet site was migrated but never cleaned up β€” 10,000 outdated pages are still indexed

CRM data:

  • 15% of customer records have no industry classification
  • Duplicate contact records across regions mean AI pulls conflicting account information

Email and Teams:

  • Teams channels created for past projects still contain outdated decisions and superseded plans
  • No archival policy means Copilot surfaces 4-year-old email threads as current context

Dr. Patel’s recommendation: Do not deploy Copilot organisation-wide until critical data hygiene is addressed. Start with a pilot in one department with clean data, and use the findings to build a data cleanup roadmap.

Dr. Patel's data preparation checklist for leaders

Before any AI deployment, ensure:

  1. Archive or delete outdated content β€” if it’s not current, it shouldn’t be in the AI’s reach
  2. Consolidate duplicate and conflicting documents into single sources of truth
  3. Review permissions β€” AI will surface anything users can access, so fix oversharing first
  4. Establish ownership β€” every key document should have an owner responsible for accuracy
  5. Create a refresh schedule β€” data that’s never updated becomes a liability, not an asset
  6. Test with real queries β€” ask the AI questions you know the answers to and verify it responds correctly

Key flashcards

Question

What are the five dimensions of data quality?

Click or press Enter to reveal answer

Answer

Accuracy (reflects reality), Completeness (no critical gaps), Timeliness (data is current), Consistency (same info recorded the same way), and Relevance (appropriate for the AI use case).

Click to flip back

Question

Why is data quality MORE critical with AI than with traditional software?

Click or press Enter to reveal answer

Answer

Traditional software crashes on bad data. AI produces polished, confident output regardless of data quality β€” making errors harder to detect. And it delivers wrong answers at scale across the organisation.

Click to flip back

Question

What is a representative dataset and why does it matter?

Click or press Enter to reveal answer

Answer

A representative dataset reflects the full diversity of people and scenarios the AI will encounter. Non-representative data leads to biased AI outputs β€” a governance, reputational, and legal risk.

Click to flip back

Knowledge check

Knowledge Check

Dr. Patel's audit finds three conflicting versions of the employee handbook in SharePoint. If Copilot is deployed now, what is the most likely outcome?

Knowledge Check

Dr. Patel is reviewing a company's hiring AI as part of a governance audit. She notices it consistently ranks candidates from certain universities higher than equally qualified candidates from other institutions. What data quality issue is this most likely caused by?

Next up: When Traditional Machine Learning Adds Value β€” understanding when old-school ML outperforms generative AI.