Git & Version Control

Git in Databricks

Simple explanation

Git is the “save game” system for your code.

Every change is tracked. You can go back to any previous version. Multiple people can work on different features without stepping on each other’s work. When ready, changes are reviewed (pull request) and merged into the main version.

Branching strategy

Branch	Purpose	Who Uses It
main	Production-ready code	Deployments read from here
develop	Integration branch for features	Team merges features here
feature/xxx	Individual feature work	One developer per branch
hotfix/xxx	Emergency production fixes	Urgent patches

main ─────────────────────────────▶
  ↑                    ↑
  │ merge PR          │ merge PR
  │                    │
develop ──────────────────────────▶
  ↑         ↑
  │ merge   │ merge
  │         │
feature/a  feature/b

Best practices for Databricks

One branch per feature — never develop directly on main
Use Git folders (Repos) in the workspace — each developer works in their own branch
Never commit credentials — use Key Vault secret scopes instead
Commit frequently with descriptive messages
Review code via pull requests before merging to develop/main

Pull requests and code review

A pull request (PR) is a request to merge your branch into another:

Developer pushes changes to feature/new-pipeline
Creates a PR to merge into develop
Team reviews the code (logic, data quality, naming)
Reviewer approves → merge completes
Feature branch is deleted

What to review in data engineering PRs

Review Area	What to Check
Logic	Does the transformation produce correct results?
Data quality	Are there expectations/checks for bad data?
Schema	Are column types appropriate?
Performance	Will this scale with production data volumes?
Security	No hardcoded secrets? Proper permissions?

Conflict resolution

Conflicts occur when two developers edit the same file:

Developer A: changes line 15 of pipeline.py
Developer B: also changes line 15 of pipeline.py

Resolution steps:

Pull the latest changes from the target branch
Git marks conflicting sections with <<<<<<< and >>>>>>>
Manually choose which changes to keep
Commit the resolved file
Push and update the PR

Prevention: keep feature branches short-lived and merge frequently.

Question

What branching strategy should you use for Databricks projects?

Click or press Enter to reveal answer

Answer

main (production), develop (integration), feature/xxx (individual features), hotfix/xxx (urgent fixes). One branch per feature, merge via pull requests, never commit directly to main.

Click to flip back

Question

What should you check during a data engineering code review?

Click or press Enter to reveal answer

Answer

Logic correctness, data quality expectations, column types/schema, performance at scale, and security (no hardcoded secrets, proper permissions).

Click to flip back

Question

How do you prevent Git conflicts in a team?

Click or press Enter to reveal answer

Answer

Keep feature branches short-lived, merge frequently, communicate about shared files, and use clear file ownership. Pull latest changes before starting new work.

Click to flip back

Knowledge check

Knowledge Check

Tomás accidentally committed a service principal client secret to a notebook in NovaPay's Git repo. What should he do FIRST?

Next up: Testing & Databricks Asset Bundles — testing strategies and modern deployment with Asset Bundles.