Domain 2 β€” Module 4 of 11 36%
12 of 27 overall
Domain 2: Implement Generative AI and Agentic Solutions Free ⏱ ~12 min read

Evaluating AI Models & Apps

How do you know if your AI is actually good? Learn how to evaluate models for fabrications, relevance, quality, and safety β€” and build evaluation into your development workflow.

Why evaluation matters

Simple explanation

Evaluation is like a quality inspection for your AI β€” you wouldn’t ship a product without testing it, and you shouldn’t deploy an AI model without evaluating it.

AI models can fabricate information (hallucinate), give irrelevant answers, produce unsafe content, or simply be bad at the task. Evaluation tells you which of these problems exist and how severe they are β€” before your users find out.

Evaluation dimensions

DimensionWhat It MeasuresEvaluatorScore Range
GroundednessIs the response based on provided context?GroundednessEvaluator1-5
RelevanceDoes it answer the actual question?RelevanceEvaluator1-5
CoherenceIs the response logical and well-structured?CoherenceEvaluator1-5
FluencyIs the language natural and readable?FluencyEvaluator1-5
SafetyIs the content free from harmful material?ContentSafetyEvaluatorPass/Fail
F1 ScoreDoes extraction output match expected fields?F1ScoreEvaluator0-1

Detecting fabrications

Fabrication (hallucination) detection is the most exam-relevant evaluation:

TypeWhat HappensExampleDetection Method
Factual fabricationModel invents facts”The policy was updated in March 2025” (it wasn’t)Groundedness evaluator against source docs
Citation fabricationModel invents references”According to Regulation 45.2.1” (doesn’t exist)Provenance checking against index
Confident fabricationModel states guesses as facts”This will definitely work because…”Calibration evaluation β€” certainty vs accuracy
Exam tip: Groundedness vs relevance

These are different evaluation dimensions:

  • Groundedness = β€œIs the answer based on the retrieved data?” (factual accuracy)
  • Relevance = β€œDoes the answer address what the user asked?” (topical accuracy)

A response can be grounded (all facts from source docs) but irrelevant (answering a different question than what was asked). Both matter.

Evaluation workflows

WhenMethodPurpose
DevelopmentManual evaluation with test datasetsIterate on prompts and RAG configuration
CI/CD pipelineAutomated evaluators on every PRGate deployments on quality thresholds
Pre-launchRed teaming with adversarial inputsFind safety gaps before users do
ProductionContinuous monitoring with samplingDetect drift and emerging quality issues

Building an evaluation dataset

ComponentWhat It ContainsExample
InputUser question”What is the refund policy for damaged goods?”
ContextRetrieved documents (for RAG)Refund policy document, Section 3.2
Expected outputGround truth answer”Damaged goods can be returned within 30 days for a full refund”
MetadataDifficulty, category, sourceDifficulty: medium, Category: refund, Source: policy_v3.pdf
Real-world example: MediaForge's evaluation pipeline

MediaForge evaluates their content generation model before every deployment:

  • Test dataset: 200 content briefs with expected outputs
  • Evaluators: Coherence (threshold: 4.0), Fluency (4.0), Relevance (4.0), Safety (must pass)
  • CI/CD gate: If any evaluator falls below threshold, deployment is blocked
  • Red teaming: Monthly adversarial testing β€” prompt injection, brand-unsafe content generation
  • Production monitoring: Sample 5% of responses daily, run all evaluators

Result: they caught a coherence regression when updating from GPT-4o to GPT-4.1 β€” the new model produced shorter responses that scored lower on completeness. They adjusted the prompt before deploying.

Key terms

Question

What is a fabrication (hallucination) in AI?

Click or press Enter to reveal answer

Answer

When the model generates information that is not supported by the provided source data or is factually incorrect. Includes inventing facts, citations, and confidently stating guesses as certainties.

Click to flip back

Question

What is the GroundednessEvaluator?

Click or press Enter to reveal answer

Answer

A Foundry evaluation tool that measures whether the model's response is factually based on the retrieved context documents. Scores 1-5, where 5 means fully grounded in provided data.

Click to flip back

Question

What is an evaluation dataset?

Click or press Enter to reveal answer

Answer

A curated set of test cases containing input questions, expected outputs, and optionally retrieved context. Used to systematically measure model quality across multiple dimensions before deployment.

Click to flip back

Question

What is red teaming for AI?

Click or press Enter to reveal answer

Answer

Adversarial testing where testers deliberately try to make the AI produce unsafe, incorrect, or unexpected outputs. Tests prompt injection, jailbreaks, edge cases, and bias. Run before launch and periodically in production.

Click to flip back

Knowledge check

Knowledge Check

NeuralMed's patient chatbot scores 4.5 on fluency and 4.2 on coherence, but only 2.1 on groundedness. What does this tell you?

Knowledge Check

Atlas Financial wants to ensure no AI model deployment happens without passing quality checks. Where should they add evaluation?