Domain 4 β€” Module 1 of 5 20%
24 of 28 overall
Domain 4: Manage Compliance by Using Microsoft Purview Free ⏱ ~14 min read

Sensitive Information Types and Data Classification

Create and manage sensitive information types using keywords, keyword lists, and regular expressions to automatically identify and classify sensitive data.

The foundation of data protection

Simple explanation

Before you can protect sensitive data, you need to find it. Sensitive information types (SITs) are the search patterns that tell Microsoft Purview what to look for.

Think of SITs like customs declarations at an airport. You train the scanner to recognise passport numbers, credit card numbers, and medical records. Once it knows what sensitive data looks like, it can flag it automatically β€” whether it’s in an email, a SharePoint document, or a Teams message.

Microsoft provides 300+ built-in SITs. For industry-specific data (patient IDs, internal codes), you create custom SITs.

Built-in vs custom SITs

Microsoft provides 300+ built-in SITs covering common data types globally:

CategoryExamplesDetection Method
FinancialCredit card numbers, bank account numbers, SWIFT codesPattern (Luhn algorithm) + keywords
HealthMedical record numbers, drug names, ICD codesPattern + medical keyword lists
IdentitySSN, passport numbers, driver’s licenceCountry-specific patterns + keywords
ITAzure storage keys, connection stringsPattern matching
RegionalNZ IRD numbers, Australian TFN, UK NINOCountry-specific formats

When you need custom SITs

Elena needs to detect MedGuard Health-specific data that no built-in SIT covers:

Data TypeFormatWhy Custom
Patient IDMG- followed by 8 digits (e.g., MG-12345678)Company-specific format
Internal drug codes3 letters + 4 digits (e.g., ASP1234)Internal classification system
Referring doctor codesDR- + 6 digitsInternal referral system

Creating custom SITs

Custom SITs are created in Microsoft Purview portal > Information Protection > Classifiers > Sensitive info types (purview.microsoft.com).

Method 1: Keyword-based SIT

For simple text matching:

ComponentExample
Keyword list”patient record”, β€œmedical history”, β€œdiagnosis report”, β€œtreatment plan”
Case sensitiveNo (recommended for most scenarios)
Word matchWhole word (prevents false positives from partial matches)

Method 2: Regular expression SIT

For structured data patterns:

ComponentExample
Primary patternMG-\d{8} (matches MG- followed by exactly 8 digits)
Supporting keywords”patient”, β€œrecord”, β€œMedGuard” (within 300 characters)
Confidence levelsHigh: pattern + 2 keywords. Medium: pattern + 1 keyword. Low: pattern only.

Method 3: Keyword dictionary

For large keyword lists (up to 1 MB post-compression):

  • Import from a file (one term per line)
  • Useful for lists of drug names, medical terms, internal project codes
  • More efficient than keyword lists for large volumes
Exam tip: Confidence levels and false positives

The exam tests your understanding of confidence levels and their impact on DLP:

  • High confidence β€” primary pattern + multiple supporting evidence. Few false positives, may miss some real data.
  • Medium confidence β€” primary pattern + some supporting evidence. Balanced.
  • Low confidence β€” primary pattern alone. Catches more data but more false positives.

DLP policies can be configured to act on different confidence levels. For example: high confidence β†’ block, medium confidence β†’ warn, low confidence β†’ log only. If the exam asks β€œElena’s DLP policy is blocking too many legitimate emails,” the answer is likely to increase the required confidence level.

Exact Data Match (EDM)

For the highest accuracy, EDM-based SITs match against your actual data:

  1. Upload a hashed version of your sensitive data (e.g., actual patient IDs from your database)
  2. Purview matches content against the hashed data
  3. Zero false positives β€” it only flags data that exists in your database

Elena uses EDM for patient IDs β€” instead of matching the pattern MG-\d{8} (which might match test data or random numbers), EDM matches only actual patient IDs from MedGuard’s patient database.

Deep dive: Trainable classifiers

Beyond pattern-based SITs, Microsoft Purview also offers trainable classifiers β€” machine learning models trained to recognise content types:

  • Pre-trained classifiers β€” resumes, source code, financial statements, legal documents
  • Custom trainable classifiers β€” trained with your own sample documents

Trainable classifiers work on content understanding (not just patterns) and are useful for unstructured data. The exam may ask about the difference: SITs match patterns, trainable classifiers match content types.

Key concepts to remember

Question

What three components make up a sensitive information type?

Click or press Enter to reveal answer

Answer

1. Primary pattern (regex, keyword list, or function). 2. Corroborative evidence (supporting keywords within proximity). 3. Confidence level (Low/Medium/High based on how many evidence elements match). Higher confidence = fewer false positives.

Click to flip back

Question

What is the difference between a keyword list and a keyword dictionary in Purview?

Click or press Enter to reveal answer

Answer

Keyword lists are small, inline collections of terms defined directly in the SIT. Keyword dictionaries are large, file-based collections (up to 1 MB post-compression) imported from a text file. Use dictionaries for drug names, medical terms, or other large reference lists.

Click to flip back

Question

What is Exact Data Match (EDM) and when should you use it?

Click or press Enter to reveal answer

Answer

EDM-based SITs match content against a hashed copy of your actual sensitive data (e.g., real patient IDs from your database). This eliminates false positives because it only matches data that exists in your records. Use for high-value data where false positives are unacceptable.

Click to flip back

Knowledge check

Knowledge Check

Elena creates a custom SIT using a regex pattern to match MedGuard patient IDs (format: MG- followed by 8 digits). The DLP policy using this SIT generates many false positives from test documents containing similar patterns. What should Elena do to reduce false positives?

Knowledge Check

Dev needs to create a SIT that detects drug names for a pharmaceutical client. The client has a list of 15,000 drug names that changes quarterly. What is the most efficient approach?


Next up: Retention Labels and Data Lifecycle β€” keeping data for as long as you need it, and disposing of it when you don’t.