Domain 2 β€” Module 2 of 11 18%
10 of 27 overall
Domain 2: Implement Generative AI and Agentic Solutions Free ⏱ ~14 min read

Building RAG Applications

Retrieval-Augmented Generation is the most important pattern in enterprise AI. Learn how to build RAG apps that ground model responses in your actual data β€” with citations, relevance, and accuracy.

What is RAG?

Simple explanation

RAG is like open-book exam for AI β€” instead of answering from memory (which might be wrong), the AI first looks up the answer in your company’s documents, then writes a response based on what it found.

Without RAG, an AI model can only use what it learned during training β€” which might be outdated or wrong for your specific domain. With RAG, the model searches your actual data before answering, so responses are grounded in facts you control.

The RAG flow

StepWhat HappensService Used
1. User queryUser asks β€œWhat’s our refund policy for damaged goods?”Your application
2. SearchQuery is sent to the search indexAzure AI Search
3. RetrieveTop relevant documents are returnedAzure AI Search
4. Augment promptRetrieved docs are injected into the system promptYour application
5. GenerateLLM generates a grounded response with citationsGPT-4o (Foundry)
6. ReturnResponse with answer + source referencesYour application

Building a RAG app β€” key decisions

DecisionOptionsRecommendation
Search typeKeyword, semantic, vector, hybridHybrid (best recall + precision)
Context windowHow many retrieved chunks to include3-5 chunks (balance relevance vs token cost)
System promptInstructions for grounding behaviour”Answer ONLY from provided context. Cite sources.”
Citation formatHow to reference sourcesInline references with document title and section
FallbackWhat to do when no relevant docs found”I don’t have information about that” (not hallucinate)
Exam tip: The grounding instruction in system prompts

The exam tests whether you know how to instruct the model to stay grounded. A common pattern:

β€œAnswer the user’s question using ONLY the information in the provided context. If the context doesn’t contain the answer, say β€˜I don’t have information about that.’ Always cite the source document.”

Without this instruction, the model may use its training data instead of your retrieved documents β€” defeating the purpose of RAG.

RAG quality factors

FactorWhat It AffectsHow to Improve
Chunking strategyWhether the right information is in a retrievable unitAlign chunks with natural document boundaries
Embedding qualityWhether similar content maps to similar vectorsUse latest embedding models, consistent pipeline
Search configurationWhether the most relevant chunks are returnedTune hybrid search weights, add semantic ranker
Prompt engineeringWhether the model uses context correctlyStrong grounding instructions, few-shot examples
Context window sizeBalance between relevance and noiseInclude top 3-5 chunks, not 20
Real-world example: NeuralMed's RAG patient chatbot

NeuralMed builds a patient information chatbot grounded in 10,000 medical articles:

  • Index: Azure AI Search with hybrid search (keyword for drug names + vector for symptoms)
  • Chunking: Paragraph-level, preserving article title and section as metadata
  • Context: Top 5 retrieved chunks injected into the prompt
  • Grounding prompt: β€œAnswer using ONLY the provided medical articles. Cite the article title. If unsure, direct the patient to consult their doctor.”
  • Fallback: β€œI don’t have specific information about that. Please consult your healthcare provider.”
  • Evaluation: Groundedness score monitored in CI/CD β€” must stay above 0.85

Common RAG pitfalls

PitfallSymptomFix
Over-chunkingModel gets 20 small fragments, none with enough contextUse larger chunks or include surrounding context
Under-chunkingEach chunk is an entire document β€” too much noiseSplit into paragraphs or sections
No grounding instructionModel uses training data instead of retrieved docsAdd explicit grounding instruction to system prompt
Stale indexResponses contain outdated informationMonitor indexer health, schedule regular refreshes
Wrong search typeNatural-language questions miss exact-term matchesUse hybrid search combining vector + keyword

Key terms

Question

What is RAG (Retrieval-Augmented Generation)?

Click or press Enter to reveal answer

Answer

An architecture pattern where a user's query first retrieves relevant documents from a search index, then those documents are injected into the LLM's prompt as context, producing a grounded response based on actual data.

Click to flip back

Question

What is grounding in the context of RAG?

Click or press Enter to reveal answer

Answer

Anchoring the model's response in retrieved source data rather than letting it generate from training data alone. Grounded responses are factually based on documents you control, reducing hallucinations.

Click to flip back

Question

What is context window in RAG?

Click or press Enter to reveal answer

Answer

The number of retrieved document chunks included in the model's prompt. Too few = missing information. Too many = noise and higher token cost. Typical: 3-5 chunks for most applications.

Click to flip back

Question

What is the grounding instruction?

Click or press Enter to reveal answer

Answer

A directive in the system prompt that tells the model to answer ONLY from provided context and to say 'I don't know' if the context doesn't contain the answer. Critical for preventing hallucinations in RAG.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial's compliance chatbot occasionally cites regulations that don't exist β€” fabricated references that look plausible. What is the most likely cause?

Knowledge Check

NeuralMed's RAG chatbot returns accurate information for common conditions but fails to answer questions about rare diseases. The articles exist in the search index. What should they investigate?