Domain 5 β€” Module 2 of 3 67%
26 of 27 overall
Domain 5: Implement Information Extraction Solutions Free ⏱ ~12 min read

Extracting Content with Content Understanding

Content Understanding is Foundry's extraction powerhouse. Learn how to extract structured data from documents using multimodal pipelines that combine OCR, layout analysis, and field extraction.

From chaos to structure

Simple explanation

Content Understanding takes messy real-world documents β€” scanned invoices, handwritten forms, photographed receipts β€” and converts them into clean, structured data that your AI can work with.

It’s like a super-efficient data entry clerk who can read any document, understand its layout, and extract the specific fields you need β€” all in seconds, not hours.

Content Understanding pipeline

StageWhat It DoesOutput
OCRReads all text from the documentRaw text with positions
Layout analysisIdentifies structure: tables, headings, sectionsStructured layout map
Field extractionPulls specific values based on the document typeNamed fields with values and confidence
Output formattingConverts to structured JSON or clean MarkdownReady for storage, APIs, or LLM consumption

Two output modes

Structured JSON vs Markdown extraction
FeatureStructured JSON OutputMarkdown Output
FormatJSON with named fields and valuesClean Markdown preserving document structure
Best forDatabase storage, API consumption, form processingLLM reasoning in RAG, agent knowledge, downstream AI
Example useInvoice processing: extract 'invoice_number: 12345'Convert contract to Markdown for compliance agent to reason about
PrecisionHigh β€” specific fields with confidence scoresComprehensive β€” full document content preserved
FlexibilityNeed to define which fields to extractAll content preserved, model decides what's relevant

Analyzers for different document types

AnalyzerDocument TypesExtracted Fields
InvoiceInvoices, billsInvoice number, date, vendor, line items, total, tax
ReceiptReceiptsMerchant, date, items, subtotal, tax, total, tip
ID documentPassports, driver’s licensesName, DOB, document number, expiry, nationality
Business cardBusiness cardsName, title, company, phone, email, address
General documentAny documentTables, key-value pairs, paragraphs, headings
CustomYour specific formatsFields you define with training examples

Producing grounded representations for agents

Content Understanding’s Markdown output is particularly powerful for AI applications:

Use CaseHow It Works
RAG groundingConvert documents to clean Markdown β†’ chunk β†’ index β†’ retrieve for RAG
Agent knowledgeExtract Markdown β†’ feed to agent as context for reasoning
Structured + reasoningExtract specific fields as JSON + full Markdown for context
Downstream reasoningClean Markdown preserves tables, headings, and relationships for LLM understanding
Real-world example: NeuralMed's medical record extraction

NeuralMed processes thousands of medical documents daily:

Lab reports (structured JSON):

  • Extract: test name, value, normal range, flag (high/low/normal)
  • Output: JSON directly into patient records database
  • Confidence threshold: 0.95 β€” below that, flag for human review

Clinical notes (Markdown):

  • Convert handwritten doctor notes to clean Markdown
  • Preserve structure: chief complaint, history, examination, assessment, plan
  • Markdown fed to diagnostic assistant agent for reasoning

Insurance forms (hybrid):

  • Structured fields: member ID, group number, dates (JSON for database)
  • Full form content: Markdown for compliance agent to verify coverage terms

Three document types, three extraction strategies β€” all through Content Understanding.

Exam tip: JSON vs Markdown output

The exam tests when to use each:

  • Need specific fields in a database? β†’ Structured JSON (field extraction)
  • Need full content for an LLM to reason about? β†’ Markdown output
  • Need both? β†’ Extract JSON fields AND produce Markdown β€” they’re not mutually exclusive

Key rule: JSON for machines, Markdown for AI models.

Key terms

Question

What is layout analysis in Content Understanding?

Click or press Enter to reveal answer

Answer

The process of understanding a document's structure β€” identifying tables, headings, sections, key-value pairs, and reading order. Enables accurate extraction even from complex multi-column layouts.

Click to flip back

Question

What is a Markdown output from Content Understanding?

Click or press Enter to reveal answer

Answer

A clean, structured Markdown representation of a document that preserves tables, headings, and content relationships. Ideal for feeding to LLMs in RAG and agent workflows because models reason well about Markdown.

Click to flip back

Question

What is a custom analyzer in Content Understanding?

Click or press Enter to reveal answer

Answer

An analyzer trained on your specific document types and fields. You provide example documents with labelled fields, and Content Understanding learns to extract those fields from new documents of the same type.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial receives 5,000 loan applications monthly as scanned PDFs. They need to extract the applicant name, loan amount, and employment status into their loan processing database. Which Content Understanding output should they use?

Knowledge Check

Kai's logistics agent needs to reason about shipping contracts to answer questions like 'What are the penalty clauses for late delivery?' The contracts are complex multi-page PDFs. Which Content Understanding output should feed the agent?