At the nexus of learning and innovation

Why use version control in your research project? And how, with git?

If you are not ready for this yet, you will be at some time - when you have felt the horror of loosing days if not weeks of work. But versioning is useful for less dramatic purposes as well, not only for backup. For instance, to help you understand why in the past you made certain changes, why you did a certain analysis in the way you did it. Or you simply want to undo a change in your analysis. Or because you want to keep an audit trail of your analysis. Or to share it with others. 

Versioning is different from backing-up

Versioning is not the same as backup. Of course, you are backing up your files, regularly (which means: daily at least). On a Mac, you probably use Time Machine, on Windows a similar product. And of course, the backup is made to a drive that is different from your hard drive. If possible, you hold all your critical files on Dropbox, Box, or other cloud services so that you have a copy of all your critical files, up-to-date, even if your house gets flooded or burns down.

Versioning is different from backup and synchronization (e.g., to dropbox) in that it is based on the meaning of changes to files, rather than regularity (on save, hourly, daily). You create a new version of your project whenever you reached a goal or solved a problem (fixed something). For instance, when you have run an analysis of some data, save the (SPSS, R...) code file as a new version to the repository, and with the save add a comment what this analysis does. 

             Version = meaningful state + description. 

This has two purposes:

  1. You yourself understand what the saved file does, even weeks or months later;
  2. Others can easier understand what you did. This is helpful should you want to share your analysis (not only your data), and/or make it auditable--more generally, reproducible, which is good scientific practice. 

(1) has the additional benefit that you need to understand for yourself what it is you just did. 

When you restore from a backup, you are going back in time: I want to restore the last version, or to yesterday's (last week's) version. When you restore from a versioning system, you go back to a decision: I want to restore my former version, or an alternative version (a different branch),  because that one turns out to be the better solution or approach. Backup is a special case of versioning. 

There are numerous tools for versioning, and the best know one is git and its web-based version, Github. (They are not really strongly different, in that you can use GitHub as the remote repository for your local git repository). 

Versioning of data and analysis files with git and GitHub

This is where versioning systems work best. They work best for files that have directly readable content: plain text, markup (such as html, xml, although readability can deteriorate quickly in markup), and markdown text. And data in tab delimited or comma delimited format. Examples are:

  • SPSS analysis files 
  • R scripts and R Notebooks
  • Any computer program, from Java to NetLogo 
  • A video/audio transcription and/or coding file, such as ELAN's EAF files (XML) 
  • Memos and such written in plain text (including markdown)

The reason that these formats work best for versioning systems is that versioning works with diffs: Differences between the lines of two versions of a file. The more semantic the lines have, the more useful are diffs. 

Versioning with git

We introduce git here because it is widely used, very fast, and free. There are also lots of introductions available, such as this video tutorial. I shall confine this introduction to the absolute basics, and only explain the main commands entered through the command line. There are numerous git desktop tools available, but it helps to understand git at the command line before you may look if a desktop application is better for your needs. 

We follow here the example from this simple guide:

Installing git: on Windows, on OSX, on Linux. You should immediately set a username and email address because you will want to copy repositories to remote servers (your computer will crash horribly at the most inconvenient time.) 

>git config --global user.name "<your name>" 
>git config --global user.email "<your email>"

What you minimally need to know about git is that it works on the level of sets of files and folders. You best think of all your project files sitting in a tree (folder with optional subfolders), and you let git version the whole folder/tree, or a set of files and folders located in that tree. 

You also need to know that git distinguishes three areas a file can be in at any point it time:

[Working directory] ---add---> [Index (Stage)] ---commit--->[HEAD] 

HEAD points to the last commit you've made. (All other versions are stored there as well, backward in time from the HEAD position.) The Index/Stage is an intermediate representation that allows fine-grained control over what gets committed, i.e., what goes into the next version. This is useful in particular in collaborative settings: Nothing gets shared with others that is not committed. You may not want to include for instance files in your (shared) commit that include information about your local installation (they would be of no use to others), or files that include passwords, for instance. If you use git only as a personal versioning system, you don't have to think about this, and git allows to go directly from changed to committed, skipping  the staging step.  

Let's say you have all your project file in a folder 'my-project'. To put this folder (and subfolders) under versioning, you need to go through three steps: 

1. Create a repository:

>git init

This will place a folder .git into my-project, where all the versions will be saved. (The dot indicates a hidden file in most file systems. To see such files, you may have to enable "view hidden files" or such in your file browser). 

2. Adding one or more files (or sub-directories) to the stage:

>git add <file name> 
>git add *

'*' adds all files (and sub-directories) from the project directory to the Index. All files which are added to the Index are tracked. Files which are in the working directory, but not added to the Index, are untracked. (Use 'git status' at any time to see what's tracked, untracked, modified, and staged.) 

3. Commit the changes:

>git commit -m "<the commit message>"

It is very important that you write informative commit messages because they help you, and potentially others, to understand the nature of the changes you are committing. 

Git will return a commit ID, which you can use as a point of reference to this newly created version. You can look at the history of commits with 'git log' (and see how useful those commit messages are). 

If your project folder is on a back-up regimen or synced with a cloud service, and if all you need is versioning for your own use, you don't need to bother with remote servers. But as soon as you want to (a) create a clone of your project (including all versions) for safety reasons, and/or want to share your work with others, you need to know how to move changes to a remote server. This will likely often by Github, but does not have to be. Any server with git server installed will do. The basic commands are:

4. Connect your local repository to a remote server:

>git remote add origin <server> 
>git remote add origin <https://github.com/<username/<project-name> (for github)

(You will need to have an account on the remote git server before this can work. See the blog on Github (forthcoming).)

5. Push your changes to the remote server:

>git push origin master

This copies the version at HEAD from your local machine to the HEAD of the master branch (don't worry about what that means right now) on the server. 

Should you want to restore the working directory content to the most recent saved version, presumably because the version in the working directory is not working, use checkout:

Reverting to last saved local version, or to a specific commit:

>git checkout -- <filename>
>git checkout -- <commit>

Of course, you can also checkout files and commits from a remote server, but we will cover this when we look at collaboration workflows in the context of Github. Speaking of future topics:

Looking forward: Versioning other types of files

What if you want to put other types of documents unter version control? That's really not a problem at all for anything written in markdown. 

Future blogs will say more on how to version 

  • Rich text documents, such as Word files (and why to use non-linear writing tools, such as Scrivener), and 
  • Media files. (Quick tip: The cheap approach here is to keep such files in a back-upped folder along with an index document that you use for describing the status of each version. Other than during video/audio editing, these won't change often, different from the analysis files.)

We will also encounter some typical workflows

  • Video coding with ELAN
  • SPSS or R data analysis
  • Qualitative analysis with NVIVO

Stay tuned! 

 

Comments

Submitted byLing Wu on Wed, 07/05/2017, 15:56

Thanks for this very clear and nice introduction to versioning. For your future articles, I am very interested in non-linear writing, would you be briefly talking about the differences between linear and non-linear writing too? And what types/genre of research would non-linear writing suitable for (if there is such thing?)  Thank you!

Submitted bySanri LR on Thu, 07/20/2017, 17:53

I think this post might have been "inspired" by my misfortune. I religiously back everything up (a few times over) but because I didn't use versioning I still "lost" some of my data analysis or rather it became useless due to me going off on a tangent that I couldn't undo. Playing around with versioning made me realise it really is like time travelling back to previous - maybe better - decisions. Like hitting a reset button and being able to start over from certain points!