Git & GitHub Basics
Version control for researchers who use Dropbox
- MIT Missing Semester — Version Control (Git) — 80-minute lecture (video + notes). The best free Git tutorial we know of.
- Software Carpentry: Version Control with Git — Workshop materials designed for researchers
- The Plain Person’s Guide to Plain Text Social Science — Kieran Healy on reproducible workflows including Git
Git Is Not Dropbox
Most researchers already have some form of version control: a final_draft folder, files named analysis_v3_REAL_final.R, or Dropbox keeping the last 180 days of edits. Git does something fundamentally different.
| Dropbox | Git |
|---|---|
| Syncs files automatically | You choose exactly when to save a snapshot |
| Versions are timestamps | Every snapshot has a message: why you made it |
| “final_v2_REAL_final.docx” | A readable history of decisions |
| Hard to undo a specific change | Easy to go back to any point in time |
| Conflict-prone with collaborators | Designed for multiple people, same project |
| Keeps everything | You ignore what doesn’t matter |
The key insight: Dropbox records when, Git records why. For research, the why is what matters.
What a Commit Really Is
A commit is a snapshot of your entire project at a specific moment, paired with a human-readable message explaining what changed and why.
commit 1: "Add raw BACI trade data for 2022"
commit 2: "Clean data: filter to Germany, remove missing values"
commit 3: "Add gravity regression with distance and GDP controls"
commit 4: "Fix: switch from OLS to PPML for zero trade flows"
commit 5: "Robustness: add country-pair fixed effects, results hold"
This isn’t just file history. It’s a research log. Six months later, when a referee asks why you switched estimators, you have a precise record: commit 4, with the message and the exact code change.
Each commit records:
- Which files changed, and exactly what changed line by line
- Who made the change and when
- The message you wrote explaining the decision
A commit message should answer the question “why did I do this?” not just “what is in this file.” "Fix bug" is useless. "Fix: PPML failed on zero flows — switch from log(value+1) to fepois" is useful.
The Three States
Every file in a Git project can be in one of three states. Understanding this removes most of the confusion people have with Git.

1. Working Directory — files as they are on your hard drive right now. You edit them normally. Git is watching but not yet recording.
2. Staging Area (also called “the index”) — files you’ve explicitly marked as “include these in the next commit.” It’s a holding area. This is where you curate what goes into a snapshot.
3. Repository — the permanent record. Commits live here. This is the history that never changes (unless you explicitly rewrite it, which you rarely need to do).
The workflow is: edit files → stage the changes you want → commit with a message.
You almost never need to think about staging explicitly when using Claude Code — you can just say “commit everything” and Claude handles it.
Letting Claude Code Handle Git
The main point of this guide is not to teach you git commands. Claude Code can handle Git entirely through plain-English instructions.
Starting a project:
“Initialize a Git repository for this project”
“Create a .gitignore for an R project with large data files”
Saving your work:
“Commit everything we’ve done today with an appropriate message”
“Make a commit: we switched the estimator from OLS to PPML and results held”
Checking what changed:
“What has changed since the last commit?”
“Show me the diff for 03_gravity.R”
Recovering from mistakes:
“Undo the last commit but keep my changes”
“I edited the wrong file — how do I restore 02_merge.R to what it was in the last commit?”
Collaborating:
“Push this to GitHub”
“I got a merge conflict in 03_gravity.R — help me resolve it”
Claude Code knows Git well. You do not need to memorize commands. The goal is to understand what Git is doing conceptually so you can ask for the right thing.
Good vs. Bad Commit Messages
A commit message has two purposes: it tells your future self what you did, and it tells collaborators why. Most bad commit messages fail on the why.
Bad:
update analysis
fix
wip
changes
cleaned data
Good:
Clean BACI data: drop observations with missing importer, filter to 2010–2022
Add PPML specification — OLS gave implausible distance elasticity (+0.3)
Fix merge: was accidentally dropping landlocked countries via distance join
Robustness: results hold when restricting to HS2-level product aggregation
Add country-pair FE following suggestion from referee 2
The good messages are complete sentences that explain the decision. They give context. They use past tense or imperative mood consistently. They are specific about what changed and why.
Ask Claude: “Write a commit message for what we just did.” It will write a good one. You can edit it if needed.
Branches in Brief
A branch is a parallel version of your project where you can try something without affecting the main version. When you’re satisfied, you merge it back.
For solo research, you rarely need branches until your project grows. The most common use case is trying a new specification or restructuring your code while keeping a stable version to fall back on. Ask Claude: “Create a branch called robustness so I can try this without breaking the main analysis.”
For collaboration, branches become essential — each co-author works on their own branch and you merge when ready. GitHub’s pull request workflow is built around this. If you’re new to collaboration, tell Claude: “Set up a simple branching workflow for two co-authors” and it will walk you through it.
GitHub: The Published Version
Git is the version control system that runs on your machine. GitHub is a website that hosts your Git repository online.
Think of it as: Git is the filing system, GitHub is the filing cabinet everyone can see.
GitHub gives you:
- Backup — your history is off your laptop
- Collaboration — co-authors can push and pull changes
- Reproducibility — share a link to your exact code at a specific commit
- Portfolio — your public repositories are visible to the community
Getting started:
- Create a free account at github.com
- Install Git:
brew install git(Mac) or it comes pre-installed with WSL (Windows — see the Windows setup guide) - Tell Claude: “Help me push this project to a new GitHub repository”
Claude will handle authentication, creating the remote, and pushing. You’ll need to authorize with GitHub once via the browser.
Private repositories are free on GitHub. If you’re working with sensitive data or unpublished results, make your repository private until you’re ready to share.
.gitignore — What Not to Track
A .gitignore file tells Git which files to ignore entirely. They won’t be tracked, won’t appear as “untracked files,” and won’t be pushed to GitHub.
You want to ignore:
- Large data files that belong in a separate data repository or are too big for GitHub (100MB limit per file)
- Generated outputs that can be recreated from code (figures, compiled files)
- Credentials and API keys — these must never go on GitHub
- System files that clutter the repo (
.DS_Storeon Mac,Thumbs.dbon Windows) - R/Stata session files that are user-specific (
.Rhistory,.RData)
A realistic .gitignore for an economics project in R:
# Data files (too large, often under license)
data/
*.dta
*.xlsx
# Generated outputs (reproducible from code)
output/figures/*.pdf
output/figures/*.png
output/tables/*.tex
# R session files
.Rhistory
.RData
.Rproj.user/
# Python artifacts
__pycache__/
*.pyc
.ipynb_checkpoints/
venv/
# Credentials (NEVER commit these)
.env
credentials.json
*.pem
*.key
api_keys.R
# System files
.DS_Store
Thumbs.db
# LaTeX build artifacts
*.aux
*.log
*.synctex.gz
Never commit API keys, passwords, or credentials to Git. Even in private repositories. Once a secret is in Git history, it’s difficult to remove completely. Use a .env file for secrets and add it to .gitignore immediately.
Ask Claude: “Create a .gitignore for this R project with BACI data and LaTeX output” and it will generate an appropriate one.
“I Messed Up” — Recovery Scenarios
Git makes mistakes recoverable. Here are the three situations researchers most commonly encounter.
Situation 1: I edited a file and want to undo all my changes.
“I changed 03_gravity.R but the regressions got worse — restore it to the last commit”
Claude will run git checkout -- 03_gravity.R or the equivalent. Your edits disappear, file is back to the last committed state.
Situation 2: I committed something I shouldn’t have.
“I accidentally committed a file with API keys — undo that commit”
If you haven’t pushed to GitHub yet: Claude can undo the commit cleanly. If you have pushed: it’s more complex — Claude will walk you through it and remind you to rotate the keys.
Situation 3: Something broke and I don’t know when.
“The regression started giving wrong results somewhere in the last week — help me figure out when”
Claude can use git bisect or review the diff between commits to find when the change happened. This is where a clean commit history pays off — if every commit is a logical unit of work with a good message, you can bisect the problem quickly.
When something goes wrong, don’t panic and don’t start deleting files. Tell Claude what happened and ask for help. Git rarely loses data — it almost always has what you need somewhere in the history.
Further Reading
- MIT Missing Semester — Version Control (Git) — 80-minute lecture (video + notes). The best free Git tutorial we know of.
- GitHub’s own Git guide — clear and comprehensive
- The Plain Person’s Guide to Plain Text Social Science — Kieran Healy’s guide on reproducible research workflows including Git for social scientists
- Software Carpentry: Version Control with Git — free workshop materials designed for researchers