class: center, middle, title-slide .title[ # Version Control and Reproducible Research ] .subtitle[ ## JSC 370: Data Science II ] --- # Part I: Introducton --- ## What is version control <div style="text-align: center;"> <table> <col width="40%"> <col width="40%"> <tr> <td style="text-align: left;"> [I]s the <strong>management of changes</strong> to documents [...] <strong>Changes are usually identified</strong> by a number or letter code, termed the "revision number", "revision level", or simply "revision". For example, an initial set of files is "revision 1". When the first change is made, the resulting set is "revision 2", and so on. <strong>Each revision is associated with a timestamp and the person making the change</strong>. Revisions can be <strong>compared</strong>, <strong>restored</strong>, and with some types of files, <strong>merged</strong>. -- <a href="https://en.wikipedia.org/w/index.php?title=Version_control&oldid=948839536" target="_blank">Wiki</a> </td> <td> <img src="https://upload.wikimedia.org/wikipedia/commons/a/af/Revision_controlled_project_visualization-2010-24-02.svg" alt="Diagram of version control" width="35%"> </td> </tr> </table> </div> --- ## Why do we care Have you ever: - Made a **change to code**, realised it was a **mistake** and wanted to **revert** back? - **Lost code** or had a backup that was too old? - Had to **maintain multiple versions** of a product? - Wanted to see the **difference between** two (or more) **versions** of your code? - Wanted to prove that a particular **change broke or fixed** a piece of code? - Wanted to **review the history** of some code? --- ## Why do we care (cont'd) - Wanted to submit a **change** to **someone else's code**? - Wanted to **share your code**, or let other people work on your code? - Wanted to see **how much work** is being done, and where, when and by whom? - Wanted to **experiment** with a new feature **without interfering** with working code? In these cases, and no doubt others, a version control system should make your life easier. -- [Stackoverflow](https://stackoverflow.com/a/1408464/2097171) (by [si618](https://stackoverflow.com/users/44540/si618)) --- ## Why do we care (cont'd) <div style="text-align: center;"> <figure> <img style="width: 600px;vertical-align: middle;" hspace="20px" src="fig/git-flow.png" alt="Workflow" </figure> <figcaption><b>Source: Jenny Bryan</b></figcaption> </div> --- ## Reproducible Research - In computational sciences and data analysis, what is reproducible research? - the data and code used to make a finding are available and they are presented in such a way that it is (relatively) straightforward for an independent researcher to recreate the finding. - This actually seldom happens. Consider two interesting articles by Tim Vines: - The Availability of Research Data Declines Rapidly with Article Age “of 516 articles published between 2 and 22 years ago…the odds of a data set being extant fell by 17% per year.” - Recommendations for utilizing and reporting population genetic analyses: the reproducibility of genetic clustering using the program structure “we reanalysed data sets gathered from papers using the software package ‘structure’… 30% of analyses were unable to reproduce the same number of population clusters.” - Scientific articles have fairly detailed methods sections, but those are typically insufficient to actually reproduce an analysis. Roger Peng and Stephanie Hicks [write](https://www.annualreviews.org/doi/pdf/10.1146/annurev-publhealth-012420-105110) "Reproducibility is typically thwarted by a lack of availability of the original data and computer code." - Scientists owe it to themselves and their community to have an explicit record of all the steps in an analysis done at a computer. --- ## Reproducible Research Do's - Start with a good question, make sure it is focused and it is something you're interested in. - Teach your computer to do the work from beginning to end! - Use version control. - Keep track of your software environment, from what is in your toolchain (software: Python, R, Tableau) to version numbers. - Set your seed for any random number generation or sampling! This is needed when splitting up your training and test sets. - Think about the entire pipeline. --- ## Reproducible Research Dont's - Don't do things by hand! - Editing spreadsheets to clean it up - Removing outliers - QA/QC - Validating - Editing tables or figures - Downloading data from a website by clicking links in a web browser - Splitting data and moving it around - If anything is done by hand because there is no other way, document it! - Point and click software or other interactive software - This type of work is not easily reproduced because there is no trace of the steps. If you have to use it, write down the steps! - Save output. Save the data and code that generated the output, rather than the output itself. --- ## Reproducible Research Challenges - Data size - Try to build in your code tools that help with this, for example parallel processing - Can store in smaller chunks and write code that pulls data files automatically, combining them when needed for analysis - Write meta data, use tools that help with data organization - Data complexity - Try to incorporate smaller snippets of data in your workflow to check reproducibility - Training, validation sets - Diagnostic visualizations - Workflow complexities - Use readme files!! --- ## Git: The stupid content tracker <div style="text-align: center;"> <figure> <a href="https://commons.wikimedia.org/wiki/File:Git-logo.svg" target="_blank"><img style="width: 200px;vertical-align: middle;" src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Git-logo.svg/500px-Git-logo.svg.png" hspace="20px" alt="Git logo"></a> <a href="https://en.wikipedia.org/wiki/Linus_Torvalds" target="_blank"><img style="width: 200px;vertical-align: middle;" hspace="20px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/LinuxCon_Europe_Linus_Torvalds_03_%28cropped%29.jpg/345px-LinuxCon_Europe_Linus_Torvalds_03_%28cropped%29.jpg" alt="Linus Torvalds"></a> </figure> <figcaption><b>Git logo and Linus Torvalds, creator of git</b></figcaption> </div> - During this class (and perhaps, the entire program) we will be using [Git](https://git-scm.com). - Git is used by [most developers in the world](https://insights.stackoverflow.com/survey/2018#work-_-version-control). - A great reference about the tool can be found [here](https://git-scm.com/book) - More on what's stupid about git [here](https://en.wikipedia.org/wiki/Git#Naming). --- ## How can I use Git There are several ways to include Git in your work-pipeline. A few are: - Through command line - Through one of the available Git GUIs: - RStudio [(link)](https://jennybc.github.io/2014-05-12-ubc/ubc-r/session03_git.html) - Git-Cola [(link)](https://git-cola.github.io/) - Github Desktop [(link)](https://desktop.github.com/) More alternatives [here](https://git-scm.com/download/gui). --- ## A Common workflow <div style="text-align: center;"> <figure> <img style="width: 600px;vertical-align: middle;" hspace="20px" src="fig/git.svg" alt="Git workflow" </figure> <figcaption><b>A common git workflow</b></figcaption> </div> --- ## A Common workflow 1. Start the session by pulling (possible) updates: `git pull` 2. Make changes a. (optional) Add untracked (possibly new) files: `git add [target file]` b. (optional) Stage tracked files that were modified: `git add [target file]` c. (optional) Revert changes on a file: `git checkout [target file]` 3. Move changes to the staging area (optional): `git add` 4. Commit: a. If nothing pending: `git commit -m "Your comments go here."` b. If modifications not staged: `git commit -a -m "Your comments go here."` 5. Upload the commit to the remote repo: `git push`. --- # Part 2: Hands-on local git repo --- ## Hands-on 0: Introduce yourself Set up your git install with `git config`, start by telling who you are ```ssh $ git config --global user.name "Meredith Franklin" $ git config --global user.email "mfranklin@email.com" ``` If you have already set up git previously, you can check your settings ```ssh $ git config --list ``` (to get out of the list in terminal, press q) Try it yourself (5 minutes) (more on how to configure git <a href="https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration" target="_blank">here</a>) --- ## Hands-on 1: Remote repository We will start by working on our very first project. To do so, you are required to start using Git and Github so you can share your code with your team. For this exercise, you need to a. Create an new (empty) repository on GitHub (you can try `JSC370`). Make sure to include a README.md (checkbox) b. Go to the local directory where you want to store the files for this repo. c. Clone the repository (in GitHub copy the repo link) `git clone https://github.com/...`. d. Back in terminal, edit the README.md. You can use nano in the terminal or open in another app such as RStudio or SublimeText. e. Add the edited README.md file to the tree using the `git add` command, and check the status. f. Make the first commit using the `git commit` command adding a message, e.g. ```sh $ git commit -m "My first commit ever!" ``` And use `git log` to see the history. Note: We are assuming that you already [installed git in your system](https://git-scm.com). --- ## Hands-on 1: Remote repository The following code is fully executable (copy-pastable) ```sh # (a) Creating the folder for the project (and getting in there) mkdir ~/JSC370 cd ~/JSC370 # (b) Initializing git, creating a file, and adding the file git init # (c) Creating the Readme file echo An empty line > README.md # (d) Adding the file to the tree git add README.md git status # (e) Commiting and checkout out the history git commit -m "My first commit ever!" git log ``` --- ## Hands-on 1: Remote repository If you add a wrong file to the tree, you can remove files from the tree using `git rm --cached`, for example, imagine that you added the file `class-notes.docx` (which you are not supposed to track), then you can remove it using ```sh $ git rm --cached class-notes.docx ``` This will remove the file from the tree **but not from your computer**. You can go further and ask git to avoid adding docx files using the [.gitignore file](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository#_ignoring) --- ## Hands-on 1: Remote repository <div style="text-align: center;"> <figure> <img style="width: 1000px;vertical-align: middle;" hspace="20px" src="fig/git1a.png" alt="New GitHub repo" </figure> <!-- <figcaption><b>A common git workflow</b></figcaption> --> </div> --- ## Hands-on 1: Remote repository <div style="text-align: center;"> <figure> <img style="width: 1000px;vertical-align: middle;" hspace="20px" src="fig/git1.png" alt="New GitHub repo" </figure> <!-- <figcaption><b>A common git workflow</b></figcaption> --> </div> --- ## Example for .gitignore Example exctracted directly from Pro-Git [(link)](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository#_ignoring). <pre style="font-size: 12pt;"> # ignore all .a files *.a # but do track lib.a, even though you're ignoring .a files above !lib.a # only ignore the TODO file in the current directory, not subdir/TODO /TODO # ignore all files in any directory named build build/ # ignore doc/notes.txt, but not doc/server/arch.txt doc/*.txt # ignore all .pdf files in the doc/ directory and any of its subdirectories doc/**/*.pdf </pre> --- # Resources - Git's everyday commands, type `man giteveryday` in your terminal/command line. and the very nice [cheatsheet](https://github.github.com/training-kit/). - My personal choice for nightstand book: The Pro-git book (free online) [(link)](https://git-scm.com/book) - Github's website of resources [(link)](https://try.github.io/) - The "Happy Git with R" book [(link)](https://happygitwithr.com/) - Roger Peng's Mastering Software Development Book Section 3.9 Version control and Github [(link)](https://bookdown.org/rdpeng/RProgDA/version-control-and-github.html) - Git exercises by Wojciech Frącz and Jacek Dajda [(link)](https://gitexercises.fracz.com/) - Checkout GitHub's Training YouTube Channel [(link)](https://www.youtube.com/user/GitHubGuides)