Domain 4 β€” Module 4 of 8 50%
24 of 28 overall
Domain 4: Deploy and Maintain Data Pipelines and Workloads Free ⏱ ~12 min read

Git & Version Control

Apply Git best practices in Databricks β€” branching strategies, pull requests, conflict resolution, and notebook version control.

Git in Databricks

Simple explanation

Git is the β€œsave game” system for your code.

Every change is tracked. You can go back to any previous version. Multiple people can work on different features without stepping on each other’s work. When ready, changes are reviewed (pull request) and merged into the main version.

Branching strategy

BranchPurposeWho Uses It
mainProduction-ready codeDeployments read from here
developIntegration branch for featuresTeam merges features here
feature/xxxIndividual feature workOne developer per branch
hotfix/xxxEmergency production fixesUrgent patches
main ─────────────────────────────▢
  ↑                    ↑
  β”‚ merge PR          β”‚ merge PR
  β”‚                    β”‚
develop ──────────────────────────▢
  ↑         ↑
  β”‚ merge   β”‚ merge
  β”‚         β”‚
feature/a  feature/b

Best practices for Databricks

  • One branch per feature β€” never develop directly on main
  • Use Git folders (Repos) in the workspace β€” each developer works in their own branch
  • Never commit credentials β€” use Key Vault secret scopes instead
  • Commit frequently with descriptive messages
  • Review code via pull requests before merging to develop/main

Pull requests and code review

A pull request (PR) is a request to merge your branch into another:

  1. Developer pushes changes to feature/new-pipeline
  2. Creates a PR to merge into develop
  3. Team reviews the code (logic, data quality, naming)
  4. Reviewer approves β†’ merge completes
  5. Feature branch is deleted

What to review in data engineering PRs

Review AreaWhat to Check
LogicDoes the transformation produce correct results?
Data qualityAre there expectations/checks for bad data?
SchemaAre column types appropriate?
PerformanceWill this scale with production data volumes?
SecurityNo hardcoded secrets? Proper permissions?

Conflict resolution

Conflicts occur when two developers edit the same file:

Developer A: changes line 15 of pipeline.py
Developer B: also changes line 15 of pipeline.py

Resolution steps:

  1. Pull the latest changes from the target branch
  2. Git marks conflicting sections with <<<<<<< and >>>>>>>
  3. Manually choose which changes to keep
  4. Commit the resolved file
  5. Push and update the PR

Prevention: keep feature branches short-lived and merge frequently.

Question

What branching strategy should you use for Databricks projects?

Click or press Enter to reveal answer

Answer

main (production), develop (integration), feature/xxx (individual features), hotfix/xxx (urgent fixes). One branch per feature, merge via pull requests, never commit directly to main.

Click to flip back

Question

What should you check during a data engineering code review?

Click or press Enter to reveal answer

Answer

Logic correctness, data quality expectations, column types/schema, performance at scale, and security (no hardcoded secrets, proper permissions).

Click to flip back

Question

How do you prevent Git conflicts in a team?

Click or press Enter to reveal answer

Answer

Keep feature branches short-lived, merge frequently, communicate about shared files, and use clear file ownership. Pull latest changes before starting new work.

Click to flip back

Knowledge check

Knowledge Check

TomΓ‘s accidentally committed a service principal client secret to a notebook in NovaPay's Git repo. What should he do FIRST?


Next up: Testing & Databricks Asset Bundles β€” testing strategies and modern deployment with Asset Bundles.