These terms are often used interchangeably, but they are different.
Repeatability: Generating the exact same results when using the same data by the same person.
Reproducibility: Generating the exact same results when using the same data by a different person or group. If we can’t reproduce a study, how can we replicate it?
Replicability: Repeating a study by independently performing another study on new data.
Repeatability vs Reproducibility vs Replicability
Reproducibility
A different analyst/researcher re-performs the analysis with the
same code and
same data and
obtains the same result
⚠️ If your results are not repeatable then they will not be reproducible!
Reproducibility
Reproducibility
Reproducibility
Barriers to doing reproducible work:
Poor documentation
Manual steps
Non-transferable tools
Incorrect training
Time
Reproducible Workflow
Reproducible Research
In academia, incentives often prioritize publication. But many results are difficult to reproduce, so there’s a push to publish code, data, and the tools needed to re-run analyses.
Reproducible Research
In computational sciences and data analysis, what is reproducibility?
Definition: The data and code used to make a finding are available and presented so that an independent researcher can (relatively) straightforwardly recreate the result.
Reproducible Research
This still seldom happens. Two examples from Tim Vines DataSeer.ai:
Data availability declines rapidly with article age (reported ~17% lower odds per year in one analysis).
Reanalyses using the program STRUCTURE found a substantial fraction of published results could not be reproduced (reported ~30% in one study).
Reproducible Research
Scientific articles often include detailed methods, but they are typically insufficient to reproduce a computational analysis.
Roger Peng and Stephanie Hicks wrote: “Reproducibility is typically thwarted by a lack of availability of the original data and computer code.”
Scientists owe it to themselves and their community to keep an explicit record of all steps in a computational analysis.
Reproducible Research Do’s
Start with a good question: make it focused and something you care about.
Teach your computer to do the work from beginning to end (automation > manual steps).
Use version control.
Track your software environment (toolchain + package versions).
Set a random seed for any random generation/sampling (e.g., train/test splits).
Think about the entire pipeline (raw data -> cleaning -> analysis -> output).
Reproducible Research Don’ts
Do not do things by hand. This includes:
Editing spreadsheets to “clean” them (e.g., removing outliers, ad hoc QA/QC)
Manually editing tables or figures
Downloading data by clicking around in a web browser
Splitting data and moving it around manually
If something truly must be done by hand, document it explicitly.
Reproducible Research Don’ts
Avoid point-and-click or highly interactive tools when possible.
They often leave no trace of the steps.
If you must use them, write down the exact sequence of actions.
Save the data and code that generated the output, rather than the output alone.
Reproducibility Challenges
Data size
Build tools into your code to manage large datasets (chunking, efficient formats, parallelism).
Store data in smaller chunks and write code that pulls and combines files automatically.
Write metadata and use tools that support data organization.
Reproducibility Challenges
Data complexity
Use smaller “toy” subsets to regularly check reproducibility.
Be explicit about training/validation/test sets.
Use diagnostic visualizations.
Workflow complexity
Use README files (and keep them updated).
What is version control?
What is version control?
Version control is the management of changes to documents and code. Changes are identified by a revision (e.g., “revision 1”, then “revision 2”, …). Each revision is associated with a timestamp and the person making the change. Revisions can be compared, restored, and sometimes merged.
On GitHub, create a new empty repo (do not add a README if you already have one locally). Then in your local project folder, add the GitHub repo as a remote named origin:
(if your default branch is master instead of main git push -u origin master or if main doesn’t exist yet locally you can create or rename it git branch -M main then run the push command above.)
Removing a mistakenly staged/tracked file
If you accidentally added a file you don’t want to track (example: class-notes.docx):
git rm --cached class-notes.docx
This removes it from Git tracking but not from your computer.
Then prevent it from being tracked again using .gitignore
# ignore all .a files*.a# but do track lib.a, even though you're ignoring .a files above!lib.a# only ignore the TODO file in the current directory, not subdir/TODO/TODO# ignore all files in any directory named buildbuild/# ignore doc/notes.txt, but not doc/server/arch.txtdoc/*.txt# ignore all .pdf files in the doc/ directory and any of its subdirectoriesdoc/**/*.pdf
Branches, Forks, Pull Requests, Merge Conflicts
A typical flow is: branch (or fork + branch) → pull request → merge → resolve conflicts (if needed)
These concepts make collaboration (mostly) painless:
Branches: work in parallel without breaking main
Forks: work on a copy of a repo when you don’t have write access
Pull Requests: propose + review changes before merging
Merge conflicts: what happens when Git can’t auto-combine edits
Branch vs Fork
Branch = a new line of work inside the same repository Fork = your own copy of the entire repository under your account
Rule of thumb:
Working in a shared class/team repo → branch
Contributing to a repo you can’t write to → fork
Branch
Repo: course-repo
You create: student/meredith-lab2
You push to the same repo
PR: student/meredith-lab2 → main
Best for: teams/classes with shared access
Fork
Upstream repo: org/course-repo
Your fork: yourname/course-repo
You work in your fork (often on a branch)
PR: yourname:branch → org:main
Best for: open-source external projects with no write access
Branches: what problem do they solve?
Without branches:
Everyone edits main
Work collides
It’s hard to experiment safely
With branches:
main stays stable
Each feature/bugfix happens on its own branch
Changes are merged back only when ready
Branches are easy in Git: creating/switching is fast.
Includes only relevant files (no accidental large data, secrets, etc.)
Example Pull Request terminal commands
# Start from main and get the latest changesgit switch maingit pull# Create a new branch for your lab workgit switch -c student/meredith-lab2# Do your work (edit files in VS Code or any editor)# (example files you might create/edit)# - train.py# - requirements.txt# - README.md# Check what changedgit statusgit diff# Stage and commitgit add train.py requirements.txt README.mdgit commit -m"Lab 2: add reproducible model training script"# Push the branch to GitHubgit push -u origin student/meredith-lab2# Open a Pull Request on GitHub:# student/meredith-lab2 --> main# After review changes are requested:# Make edits, then repeat add/commit/pushgit add .git commit -m"Address PR feedback"git push
Merging: what does it mean?
Merging integrates two lines of work by combining their histories:
Fast-forward merge:main simply moves forward (no divergence)
3-way merge: Git creates a new merge commit that joins two lines of work
Either way, the goal is the same: integrate branch work into main.
Merge conflicts: what are they?
A merge conflict happens when:
two branches edited the same lines in the same file, and
Git can’t determine how to combine them safely
Important:
Conflicts are normal in collaboration
They’re not “errors” so much as “decisions Git asks humans to make”
When do conflicts happen most?
Long-lived branches (you drift far from main)
Many people editing the same file
Moving/renaming files while someone else edits them
Preventing conflicts (best practices)
Pull often (or merge main into your branch regularly)
Keep PRs small and merge them sooner
Avoid huge “format everything” commits mixed with logic changes
Communicate: “I’m editing slides/week2.qmd today”
What a conflict looks like
Git inserts markers like this into a file:
<<<<<<< HEADThis is the version from main=======This is the version from your branch>>>>>>> student/meredith-lab2
Resources
Git everyday commands: man giteveryday in terminal