Building RAG Applications

What is RAG?

Simple explanation

RAG is like open-book exam for AI — instead of answering from memory (which might be wrong), the AI first looks up the answer in your company’s documents, then writes a response based on what it found.

Without RAG, an AI model can only use what it learned during training — which might be outdated or wrong for your specific domain. With RAG, the model searches your actual data before answering, so responses are grounded in facts you control.

The RAG flow

Step	What Happens	Service Used
1. User query	User asks “What’s our refund policy for damaged goods?”	Your application
2. Search	Query is sent to the search index	Azure AI Search
3. Retrieve	Top relevant documents are returned	Azure AI Search
4. Augment prompt	Retrieved docs are injected into the system prompt	Your application
5. Generate	LLM generates a grounded response with citations	GPT-4o (Foundry)
6. Return	Response with answer + source references	Your application

Building a RAG app — key decisions

Decision	Options	Recommendation
Search type	Keyword, semantic, vector, hybrid	Hybrid (best recall + precision)
Context window	How many retrieved chunks to include	3-5 chunks (balance relevance vs token cost)
System prompt	Instructions for grounding behaviour	”Answer ONLY from provided context. Cite sources.”
Citation format	How to reference sources	Inline references with document title and section
Fallback	What to do when no relevant docs found	”I don’t have information about that” (not hallucinate)

Exam tip: The grounding instruction in system prompts

The exam tests whether you know how to instruct the model to stay grounded. A common pattern:

“Answer the user’s question using ONLY the information in the provided context. If the context doesn’t contain the answer, say ‘I don’t have information about that.’ Always cite the source document.”

Without this instruction, the model may use its training data instead of your retrieved documents — defeating the purpose of RAG.

RAG quality factors

Factor	What It Affects	How to Improve
Chunking strategy	Whether the right information is in a retrievable unit	Align chunks with natural document boundaries
Embedding quality	Whether similar content maps to similar vectors	Use latest embedding models, consistent pipeline
Search configuration	Whether the most relevant chunks are returned	Tune hybrid search weights, add semantic ranker
Prompt engineering	Whether the model uses context correctly	Strong grounding instructions, few-shot examples
Context window size	Balance between relevance and noise	Include top 3-5 chunks, not 20

Real-world example: NeuralMed's RAG patient chatbot

NeuralMed builds a patient information chatbot grounded in 10,000 medical articles:

Index: Azure AI Search with hybrid search (keyword for drug names + vector for symptoms)
Chunking: Paragraph-level, preserving article title and section as metadata
Context: Top 5 retrieved chunks injected into the prompt
Grounding prompt: “Answer using ONLY the provided medical articles. Cite the article title. If unsure, direct the patient to consult their doctor.”
Fallback: “I don’t have specific information about that. Please consult your healthcare provider.”
Evaluation: Groundedness score monitored in CI/CD — must stay above 0.85

Common RAG pitfalls

Pitfall	Symptom	Fix
Over-chunking	Model gets 20 small fragments, none with enough context	Use larger chunks or include surrounding context
Under-chunking	Each chunk is an entire document — too much noise	Split into paragraphs or sections
No grounding instruction	Model uses training data instead of retrieved docs	Add explicit grounding instruction to system prompt
Stale index	Responses contain outdated information	Monitor indexer health, schedule regular refreshes
Wrong search type	Natural-language questions miss exact-term matches	Use hybrid search combining vector + keyword

Key terms

Question

What is RAG (Retrieval-Augmented Generation)?

Click or press Enter to reveal answer

Answer

An architecture pattern where a user's query first retrieves relevant documents from a search index, then those documents are injected into the LLM's prompt as context, producing a grounded response based on actual data.

Click to flip back

Question

What is grounding in the context of RAG?

Click or press Enter to reveal answer

Answer

Anchoring the model's response in retrieved source data rather than letting it generate from training data alone. Grounded responses are factually based on documents you control, reducing hallucinations.

Click to flip back

Question

What is context window in RAG?

Click or press Enter to reveal answer

Answer

The number of retrieved document chunks included in the model's prompt. Too few = missing information. Too many = noise and higher token cost. Typical: 3-5 chunks for most applications.

Click to flip back

Question

What is the grounding instruction?

Click or press Enter to reveal answer

Answer

A directive in the system prompt that tells the model to answer ONLY from provided context and to say 'I don't know' if the context doesn't contain the answer. Critical for preventing hallucinations in RAG.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial's compliance chatbot occasionally cites regulations that don't exist — fabricated references that look plausible. What is the most likely cause?

Knowledge Check

NeuralMed's RAG chatbot returns accurate information for common conditions but fails to answer questions about rare diseases. The articles exist in the search index. What should they investigate?