Domain 5 β€” Module 1 of 3 33%
25 of 27 overall
Domain 5: Implement Information Extraction Solutions Free ⏱ ~14 min read

Ingestion, Indexing & Grounding Pipelines

RAG applications need data pipelines behind them. Learn how to ingest documents, configure semantic/hybrid/vector search, apply enrichment skills, and connect retrieval pipelines to agent workflows.

The data pipeline behind RAG

Simple explanation

If RAG is an open-book exam, the ingestion pipeline is the process of preparing, organising, and indexing all the books before the exam starts.

You take raw documents (PDFs, Word files, images, videos), break them into searchable chunks, add metadata, create embeddings for vector search, and load everything into a search index. When a user asks a question, the search index finds the right chunks in milliseconds.

This module covers the pipeline engineering β€” the plumbing that makes RAG work.

The ingestion pipeline

StageWhat HappensService
SourceRaw documents in storage (Blob, Data Lake, SharePoint)Azure Storage
CrackExtract text from files (PDF parsing, OCR for images, audio transcription)Azure AI Search indexer + Content Understanding
ChunkSplit into search-friendly segmentsChunking strategy (fixed, semantic, paragraph)
EnrichAdd metadata, extract entities, classify contentBuilt-in skills or custom skills
EmbedGenerate vector representationsEmbedding model (text-embedding-3-small)
IndexStore in searchable index with field mappingsAzure AI Search
ServeConnect to agents, workflows, and applicationsFoundry SDK, agent tools

Content types and ingestion methods

Content TypeCracking MethodKey Considerations
PDF documentsBuilt-in PDF parser + OCR for scanned pagesOCR quality depends on scan quality
Office documentsBuilt-in parsers (Word, Excel, PowerPoint)Tables and charts need special handling
ImagesOCR for text, Content Understanding for structureBuilt-in OCR supports handwritten text; very poor handwriting may need review
Audio filesSpeech-to-text transcriptionLanguage and accent affect accuracy
Video filesFrame extraction + audio transcriptionHigh storage and compute requirements
Web pagesHTML parsing, content extractionExclude navigation, ads, boilerplate

Enrichment skills

Skill TypeWhat It DoesExample
Built-in: Entity extractionIdentifies people, places, organisationsTag documents with mentioned companies
Built-in: Language detectionIdentifies document languageRoute to correct language model
Built-in: Key phrase extractionExtracts important phrasesGenerate topic tags for filtering
Built-in: OCRReads text from images within documentsExtract text from embedded charts
Custom skillYour own enrichment logic (API)Industry-specific classification, PII detection
Built-in vs custom enrichment skills
FeatureBuilt-in SkillsCustom Skills
SetupConfigure in the indexer skillsetWrite code, deploy as API, reference in skillset
MaintenanceManaged by MicrosoftYou manage the code and infrastructure
CapabilitiesGeneral-purpose NLP enrichmentAny custom logic you need
CostIncluded in Search pricingYour compute costs + Search pricing
Best forStandard metadata enrichmentDomain-specific classification, PII, business logic

Connecting pipelines to agents

Connection MethodHow It WorksBest For
Direct index queryAgent tool calls Azure AI Search directlyFull control over search parameters
Foundry IQUpload to Foundry IQ, auto-indexedQuick agent setup, managed pipeline
Custom retrieval functionAgent calls your function, which queries the indexComplex retrieval logic, multi-index queries
Real-world example: NeuralMed's medical article pipeline

NeuralMed ingests 10,000+ medical articles for their patient chatbot:

  1. Source: PubMed articles in Blob Storage (PDF format)
  2. Crack: PDF parser extracts text + OCR for embedded figures
  3. Chunk: Paragraph-level chunking (medical context needs larger chunks)
  4. Enrich:
    • Built-in: entity extraction (drug names, conditions, treatments)
    • Custom skill: medical speciality classifier (cardiology, neurology, etc.)
    • Custom skill: PII detector (redacts patient info from case studies)
  5. Embed: text-embedding-3-small for vector search
  6. Index: Azure AI Search with hybrid search (keyword for drug names + vector for symptoms)
  7. Serve: Connected to patient chatbot agent as a knowledge tool

Pipeline runs weekly on new articles. Incremental indexing only processes changed documents.

Exam tip: OCR in the RAG pipeline

The exam specifically mentions OCR in RAG ingestion flows. Key points:

  • OCR is needed for scanned PDFs, image-based documents, and photos of forms
  • OCR quality directly affects RAG quality β€” garbage in, garbage out
  • Azure AI Search’s built-in OCR skill handles common cases
  • For high-accuracy OCR (medical, legal), use Content Understanding’s OCR capability

If a question mentions β€œscanned documents” in a RAG context, OCR is the answer.

Key terms

Question

What is document cracking?

Click or press Enter to reveal answer

Answer

The process of extracting raw text and metadata from source files (PDFs, Office docs, images). The first stage of an ingestion pipeline. Uses parsers for structured formats and OCR for image-based content.

Click to flip back

Question

What are enrichment skills in Azure AI Search?

Click or press Enter to reveal answer

Answer

Processing steps that add metadata to indexed content during ingestion. Built-in skills include entity extraction, key phrases, and OCR. Custom skills are your own code (deployed as APIs) for domain-specific enrichment.

Click to flip back

Question

What is incremental indexing?

Click or press Enter to reveal answer

Answer

An indexing strategy that only processes new or changed documents instead of re-indexing everything. Reduces cost and time for regularly updated content collections.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial needs to index 50,000 scanned regulatory documents (image PDFs) for their compliance agent's knowledge base. Many documents contain handwritten annotations. What pipeline configuration is critical?

Knowledge Check

Kai's logistics platform indexes shipping documents from 15 countries in different languages. The search results need to include document language and key shipping terms as filterable metadata. Which enrichment approach should he use?