Version Control with Git

Overview
Questions:
  • What is version control and why should I use it?

  • How do I get set up to use Git?

  • Where does Git store information?

  • How do I record changes in Git?

  • How do I check the status of my version control repository?

  • How do I record notes about what changes I made and why?

  • How can I identify old versions of files?

  • How do I review my changes?

  • How can I recover old versions of files?

Objectives:
  • Understand the benefits of an automated version control system.

  • Understand the basics of how automated version control systems work.

  • Configure git the first time it is used on a computer.

  • Understand the meaning of the --global configuration flag.

  • Create a local Git repository.

  • Describe the purpose of the .git directory.

  • Go through the modify-add-commit cycle for one or more files.

  • Explain where information is stored at each stage of that cycle.

  • Distinguish between descriptive and non-descriptive commit messages.

  • Explain what the HEAD of a repository is and how to use it.

  • Identify and use Git commit numbers.

  • Compare various versions of tracked files.

  • Restore old versions of files.

Time estimation: 65 minutes
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Version control is a way of tracking the change history of a project, and git is one of the most popular systems for doing that! This tutorial will guide you through the basics of using git for version control.

Agenda

In this tutorial, you will learn how to create a git repo, and begin working with it.

  1. Basics
  2. Setting up git
  3. Create a Repo
    1. Places to Create Git Repositories
    2. Correcting git init Mistakes
  4. Tracking Changes
  5. Let’s put us to the test
  6. History Exploring

Basics

We’ll start by exploring how version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, automated version control is much better than this situation:

Cartoon titled 'final'.doc, showing a grad student and their advisor going through multiple revisions. The first named final.doc, then final_rev.2.doc, final_rev.6.comments.doc, a long filename with the revision number 18, until a final filename, revision 22, with special characters indicating frustration where the file name includes the text 'why did I come to grad school'.
Figure 1: Piled Higher and Deeper by Jorge Cham, http://www.phdcomics.com

We’ve all been in this situation before: it seems unnecessary to have multiple nearly-identical versions of the same document. Some word processors let us deal with this a little better, such as Microsoft Word’s Track Changes, Google Docs’ version history, or LibreOffice’s Recording and Displaying Changes.

Version control systems start with a base version of the document and then record changes you make each step of the way. You can think of it as a recording of your progress: you can rewind to start at the base document and play back each change you made, eventually arriving at your more recent version.

Changes Are Saved Sequentially, graphic shows three documents with text being added in each new revision.

Once you think of changes as separate from the document itself, you can then think about “playing back” different sets of changes on the base document, ultimately resulting in different versions of that document. For example, two users can make independent sets of changes on the same document.

Different Versions Can be Saved, showing a document splitting into two, with different changes.

Unless multiple users make changes to the same section of the document - a conflict - you can incorporate two sets of changes into the same base document.

Multiple Versions Can be Merged, shows two documents with different changes merging into a final document with both changes.

A version control system is a tool that keeps track of these changes for us, effectively creating different versions of our files. It allows us to decide which changes will be made to the next version (each record of these changes is called a commit, and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers, facilitating collaboration among different people.

Automated version control systems are nothing new. Tools like RCS, CVS, or Subversion have been around since the early 1980s and are used by many large companies. However, many of these are now considered legacy systems (i.e., outdated) due to various limitations in their capabilities. More modern systems, such as Git and Mercurial, are distributed, meaning that they do not need a centralized server to host the repository. These modern systems also include powerful merging tools that make it possible for multiple authors to work on the same files concurrently.

Question: Paper Writing
  • Imagine you drafted an excellent paragraph for a paper you are writing, but later ruin it. How would you retrieve the excellent version of your conclusion? Is it even possible?

  • Imagine you have 5 co-authors. How would you manage the changes and comments they make to your paper? If you use LibreOffice Writer or Microsoft Word, what happens if you accept changes made using the Track Changes option? Do you have a history of those changes?

  • Recovering the excellent version is only possible if you created a copy of the old version of the paper. The danger of losing good versions often leads to the problematic workflow illustrated in the PhD Comics cartoon at the top of this page.

  • Collaborative writing with traditional word processors is cumbersome. Either every collaborator has to work on a document sequentially (slowing down the process of writing), or you have to send out a version to all collaborators and manually merge their comments into your document. The ‘track changes’ or ‘record changes’ option can highlight changes for you and simplifies merging, but as soon as you accept changes you will lose their history. You will then no longer know who suggested that change, why it was suggested, or when it was merged into the rest of the document. Even online word processors like Google Docs or Microsoft Office Online do not fully resolve these problems.

Before diving in the tutorial, we need to open RStudio Tool: interactive_tool_rstudio . If you do not know how or never interacted with RStudio, please follow the dedicated tutorial.

Hands-on: Launch RStudio

Depending on which server you are using, you may be able to run RStudio directly in Galaxy. If that is not available, RStudio Cloud can be an alternative.

Launch RStudio in Galaxy

Currently RStudio in Galaxy is only available on UseGalaxy.eu and UseGalaxy.org

  1. Open the Rstudio tool tool by clicking here to launch RStudio
  2. Click Execute
  3. The tool will start running and will stay running permanently
  4. Click on the “User” menu at the top and go to “Active InteractiveTools” and locate the RStudio instance you started.
Launch RStudio Cloud if not available on Galaxy

If RStudio is not available on the Galaxy instance:

  1. Register for RStudio Cloud, or login if you already have an account
  2. Create a new project
Hands-on: Installing git

The R Console and other interactive tools like RStudio are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. This is one of those cases and, in order to do that, we will use the terminal provided by the RStudio itself. We go to “Tools” and pick the “Shell…” option and we are good to go. Our workspace is the left, terminal window that just opened.

Fortunately, miniconda is already installed. Miniconda is a package manager that simplifies the installation processes. We can and will use it to install every essential package for our tutorial. However, it is of critical importance that we do that in an new environment within our existing base and install our packages in said environment.

Input: Environment and Packages
$ conda create -n name_of_your_env nano git
$ conda activate name_of_your_env
Software Version Manual Available for Description
git 2.35.3 git Manual Linux, MacOS Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
GNU nano 2.9.8 Nano manual Linux, MacOS GNU nano is a small and friendly text editor.

Setting up git

When we use Git on a new computer for the first time, we need to configure a few things. Below are a few examples of configurations we will set as we get started with Git:

  • our name and email address,
  • what our preferred text editor is,
  • and that we want to use these settings globally (i.e. for every project).

On a command line, Git commands are written as git verb options, where verb is what we actually want to do and options is additional optional information which may be needed for the verb. So here is how Sherlock sets up his new laptop:

Input: Setting up with bash
$ git config --global user.name "Sherlock Holmes"
$ git config --global user.email "sherlock@baker.street"

Please use your own name and email address instead of Sherlocks’s. This user name and email will be associated with your subsequent Git activity, which means that any changes pushed to GitHub, BitBucket, GitLab or another Git host server after this lesson will include this information.

For this lesson, we will be interacting with GitHub and so the email address used should be the same as the one used when setting up your GitHub account. If you are concerned about privacy, please review GitHub’s instructions for keeping your email address private.

If you choose to use a private email address with GitHub, then use that same email address for the user.email value, e.g. username@users.noreply.github.com replacing username with your GitHub one.

As with other keys, when you hit Enter or or on Macs, Return, on your keyboard, your computer encodes this input as a character. Different operating systems use different character(s) to represent the end of a line. (You may also hear these referred to as newlines or line breaks.) Because Git uses these characters to compare files, it may cause unexpected issues when editing a file on different machines. Though it is beyond the scope of this lesson, you can read more about this issue in the Pro Git book.

You can change the way Git recognizes and encodes line endings using the core.autocrlf command to git config. The following settings are recommended:

Input: Settings with bash

On macOS and Linux:

$ git config --global core.autocrlf input

And on Windows:

$ git config --global core.autocrlf true

Sherlock also has to set his favorite text editor, following this table:

Editor Configuration command
Atom $ git config --global core.editor "atom --wait"
nano $ git config --global core.editor "nano -w"
BBEdit (Mac, with command line tools) $ git config --global core.editor "bbedit -w"
Sublime Text (Mac) $ git config --global core.editor "/Applications/Sublime\ Text.app/Contents/SharedSupport/bin/subl -n -w"
Sublime Text (Win, 32-bit install) $ git config --global core.editor "'c:/program files (x86)/sublime text 3/sublime_text.exe' -w"
Sublime Text (Win, 64-bit install) $ git config --global core.editor "'c:/program files/sublime text 3/sublime_text.exe' -w"
Notepad (Win) $ git config --global core.editor "c:/Windows/System32/notepad.exe"
Notepad++ (Win, 32-bit install) $ git config --global core.editor "'c:/program files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
Notepad++ (Win, 64-bit install) $ git config --global core.editor "'c:/program files/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
Kate (Linux) $ git config --global core.editor "kate"
Gedit (Linux) $ git config --global core.editor "gedit --wait --new-window"
Scratch (Linux) $ git config --global core.editor "scratch-text-editor"
Emacs $ git config --global core.editor "emacs"
Vim $ git config --global core.editor "vim"
VS Code $ git config --global core.editor "code --wait"

It is possible to reconfigure the text editor for Git whenever you want to change it.

Note that Vim is the default editor for many programs. If you haven’t used Vim before and wish to exit a session without saving your changes, press Esc then type :q! and hit Enter or or on Macs, Return. If you want to save your changes and quit, press Esc then type :wq and hit Enter or or on Macs, Return.

Git (2.28+) allows configuration of the name of the branch created when you initialize any new repository. Sherlock decides to use that feature to set it to main so it matches the cloud service he will eventually use.

Input: Configure the name of the created branch
$ git config --global init.defaultBranch main

Source file changes are associated with a “branch.” For new learners in this lesson, it’s enough to know that branches exist, and this lesson uses one branch. By default, Git will create a branch called master when you create a new repository with git init (as explained in the next Episode). This term evokes the racist practice of human slavery and the software development community has moved to adopt more inclusive language.

In 2020, most Git code hosting services transitioned to using main as the default branch. As an example, any new repository that is opened in GitHub and GitLab default to main. However, Git has not yet made the same change. As a result, local repositories must be manually configured have the same main branch name as most cloud services.

For versions of Git prior to 2.28, the change can be made on an individual repository level. The command for this is in the next episode. Note that if this value is unset in your local Git configuration, the init.defaultBranch value defaults to master.

The five commands we just ran above only need to be run once: the flag --global tells Git to use the settings for every project, in your user account, on this computer.

You can check your settings at any time:

Input: Checking your settings
$ git config --list

You can change your configuration as many times as you want: use the same commands to choose another editor or update your email address.

In some networks you need to use a proxy. If this is the case, you may also need to tell Git about the proxy:

Input: Git and proxy
$ git config --global http.proxy proxy-url
$ git config --global https.proxy proxy-url
Input: To disable the proxy, use
$ git config --global --unset http.proxy
$ git config --global --unset https.proxy

Always remember that if you forget the subcommands or options of a git command, you can access the relevant list of options typing git <command> -h or access the corresponding Git manual by typing git <command> --help, e.g.:

Input: Access help
$ git config -h
$ git config --help

While viewing the manual, remember the : is a prompt waiting for commands and you can press Q to exit the manual.

More generally, you can get the list of available git commands and further resources of the Git manual typing:

Input: Access available commands
$ git help

Create a Repo

Once Git is configured, we can start using it.

We will continue with the story of Sherlock who is investigating a crime and is collecting information about suspects.

Cartoon of sherlock examining the git logo with his magnifying glass.

First, let’s create a directory for our work and then move into that directory:

Input: Create a workspace
$ mkdir suspects
$ cd suspects

Then we tell Git to make suspects a repository – a place where Git can store versions of our files:

Input: Turn our workspace into directory
$ git init

It is important to note that git init will create a repository that includes subdirectories and their files—there is no need to create separate repositories nested within the suspects repository, whether subdirectories are present from the beginning or added later. Also, note that the creation of the suspects directory and its initialization as a repository are completely separate processes.

If we use ls to show the directory’s contents, it appears that nothing has changed:

Input: Show directory content
$ ls

But if we add the -a flag to show everything, we can see that Git has created a hidden directory within suspects called .git:

Input: Show everything in our directory
$ ls -a
Output
.	..	.git

Git uses this special subdirectory to store all the information about the project, including all files and sub-directories located within the project’s directory. If we ever delete the .git subdirectory, we will lose the project’s history.

Next, we will change the default branch to be called main. This might be the default branch depending on your settings and version of git.

Input: Rename the branch
$ git checkout -b main
Output
Switched to a new branch 'main'

We can check that everything is set up correctly by asking Git to tell us the status of our project:

Input: Check
$ git status
Output
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

If you are using a different version of git, the exact wording of the output might be slightly different.

Places to Create Git Repositories

Along with tracking information about suspects (the project we have already created), Sherlock would also like to track information specific clues. So, Sherlock creates a clues project inside his suspects project with the following sequence of commands:

Input: Create a project within a project
$ cd ~/Desktop   # return to Desktop directory
$ cd suspects     # go into suspects directory, which is already a Git repository
$ ls -a          # ensure the .git subdirectory is still present in the suspects directory
$ mkdir clues    # make a subdirectory suspects/clues
$ cd clues       # go into clues subdirectory
$ git init       # make the clues subdirectory a Git repository
$ ls -a          # ensure the .git subdirectory is present indicating we have created a new Git repository
Question: Tracking in a subdirectory

Is the git init command, run inside the clues subdirectory, required for tracking files stored in the clues subdirectory?

No. Sherlock does not need to make the clues subdirectory a Git repository because the suspects repository will track all files, sub-directories, and subdirectory files under the suspects directory. Thus, in order to track all information about clues, Sherlock only needed to add the clues subdirectory to the suspects directory.

Additionally, Git repositories can interfere with each other if they are “nested”: the outer repository will try to version-control the inner repository. Therefore, it’s best to create each new Git repository in a separate directory. To be sure that there is no conflicting repository in the directory, check the output of git status. If it looks like the following, you are good to go to create a new repository as shown above:

Input: Before creating a new repository
$ git status
Output
fatal: Not a git repository (or any of the parent directories): .git

Correcting git init Mistakes

Dr. Watson explains to Sherlock how a nested repository is redundant and may cause confusion down the road. Sherlock would like to remove the nested repository. How can Sherlock undo his last git init in the clues subdirectory?

USE rm WITH CAUTION!

Removing files from a Git repository needs to be done with caution. But we have not learned yet how to tell Git to track a particular file; we will learn this in the next episode. Files that are not tracked by Git can easily be removed like any other “ordinary” files with

Input: Remove files
$ rm filename

Similarly a directory can be removed using rm -r dirname or rm -rf dirname. If the files or folder being removed in this fashion are tracked by Git, then their removal becomes another change that we will need to track, as we will see later.

Git keeps all of its files in the .git directory. To recover from this little mistake, Sherlock can just remove the .git folder in the clues subdirectory by running the following command from inside the suspects directory:

Input: Remove the `.git` folder
$ rm -rf clues/.git

But be careful! Running this command in the wrong directory will remove the entire Git history of a project you might want to keep. Therefore, always check your current directory using the command pwd.

Tracking Changes

First let’s make sure we’re still in the right directory. You should be in the suspects directory.

Input: Check directory
$ cd ~/suspects

Let’s create a file called colonel.txt that contains some notes about the colonel Smith’s connection to the case. We’ll use nano to edit the file; you can use whatever editor you like. In particular, this does not have to be the core.editor you set globally earlier. But remember, the bash command to create or edit a new file will depend on the editor you choose (it might not be nano). For a refresher on text editors, check out “Which Editor?” in The Unix Shell lesson.

Input: Edit file with nano in bash
$ nano colonel.txt

Type the text below into the colonel.txt file:

No alibi for the night of murder.

Let’s first verify that the file was properly created by running the list command (ls):

Input: Check new file
$ ls
Output
colonel.txt

colonel.txt contains a single line, which we can see by running:

Input: Inspect the new file
$ cat colonel.txt
Output
No alibi for the night of murder.

If we check the status of our project again, Git tells us that it’s noticed the new file:

Input: Check
$ git status
Output
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)

colonel.txt

nothing added to commit but untracked files present (use "git add" to track)

The “untracked files” message means that there’s a file in the directory that Git isn’t keeping track of. We can tell Git to track a file using git add:

Input: Track file
$ git add colonel.txt

and then check that the right thing happened:

Input: Check
$ git status
Output
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   colonel.txt

Git now knows that it’s supposed to keep track of colonel.txt, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:

Input: Record changes as a commit with a descriptive title
$ git commit -m "Start notes for colonel Smith as a suspect"
Output
[main (root-commit) f22b25e] Start notes on colonel Smith as a suspect
1 file changed, 1 insertion(+)
create mode 100644 colonel.txt

When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is f22b25e. Your commit may have another identifier.

We use the -m flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured as core.editor) so that we can write a longer message.

Good commit messages start with a brief <50 characters) statement about the changes made in the commit. Generally, the message should complete the sentence “If applied, this commit will” . If you want to go into more detail, add a blank line between the summary line and your additional notes. Use this additional space to explain why you made changes and/or what their impact will be.

Input: If we run `git status` now:
$ git status
Output
On branch main
nothing to commit, working directory clean

it tells us everything is up to date. If we want to know what we’ve done recently, we can ask Git to show us the project’s history using git log:

Input: Access project's history
$ git log
Output
commit f22b25e3233b4645dabd0d81e651fe074bd8e73b
Author: Sherlock Holmes <sherlock@baker.street>
Date:   Thu Aug 22 09:51:46 2013 -0400

  Start notes on colonel as a suspect

git log lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created.

If we run ls at this point, we will still see just one file called colonel.txt. That’s because Git saves information about files’ history in the special .git directory mentioned earlier so that our filesystem doesn’t become cluttered (and so that we can’t accidentally edit or delete an old version).

Now suppose Sherlock adds more information to the file. (Again, we’ll edit with nano and then cat the file to show its contents; you may use a different editor, and don’t need to cat.)

Input: Edit with nano
$ nano colonel.txt
$ cat colonel.txt
Output
No alibi for the night of murder.
No clear motive. Seems high unlikely.

When we run git status now, it tells us that a file, it already knows about, has been modified:

Input: Find about current status
$ git status
Output
On branch main
Changes not staged for commit:
 (use "git add <file>..." to update what will be committed)
 (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   colonel.txt

no changes added to commit (use "git add" and/or "git commit -a")

The last line is the key phrase: “no changes added to commit”. We have changed this file, but we haven’t told Git we will want to save those changes (which we do with git add) nor have we saved them (which we do with git commit). So let’s do that now. It is good practice to always review our changes before saving them. We do this using git diff. This shows us the differences between the current state of the file and the most recently saved version:

Input: Review the changes
$ git diff
Output
diff --git a/colonel.txt b/colonel.txt
index df0654a..315bf3a 100644
--- a/colonel.txt
+++ b/colonel.txt
@@ -1 +1,2 @@
 No alibi for the night of murder.
+No clear motive. Seems high unlikely.

The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we break it down into pieces:

  1. The first line tells us that Git is producing output similar to the Unix diff command comparing the old and new versions of the file.
  2. The second line tells exactly which versions of the file Git is comparing; df0654a and 315bf3a are unique computer-generated labels for those versions.
  3. The third and fourth lines once again show the name of the file being changed.
  4. The remaining lines are the most interesting, they show us the actual differences and the lines on which they occur. In particular, the + marker in the first column shows where we added a line.

After reviewing our change, it’s time to commit it:

Input: Commit after checking changes
$ git commit -m "Add concerns about existence of motive for colonel"
Output
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   colonel.txt

no changes added to commit (use "git add" and/or "git commit -a")

Whoops: Git won’t commit because we didn’t use git add first. Let’s fix that:

Input: First `add` then `commit`
$ git add colonel.txt
$ git commit -m "Add concerns about existence of motive for colonel"
Output
[main 34961b1] Add concerns about existence of motive for colonel
 1 file changed, 1 insertion(+)

Git insists that we add files to the set we want to commit before actually committing anything. This allows us to commit our changes in stages and capture changes in logical portions rather than only large batches. For example, suppose we’re adding a few citations to relevant research to our thesis. We might want to commit those additions, and the corresponding bibliography entries, but not commit some of our work drafting the conclusion (which we haven’t finished yet).

To allow for this, Git has a special staging area where it keeps track of things that have been added to the current changeset but not yet committed.

If you think of Git as taking snapshots of changes over the life of a project, git add specifies what will go in a snapshot (putting things in the staging area), and git commit then actually takes the snapshot, and makes a permanent record of it (as a commit). If you don’t have anything staged when you type git commit, Git will prompt you to use git commit -a or git commit --all, which is kind of like gathering everyone to take a group photo! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. (Going back to the group photo simile, you might get an extra with incomplete makeup walking on the stage for the picture because you used -a!) Try to stage things manually, or you might find yourself searching for “git undo commit” more than you would like!

The Git Staging Area cartoon, a document is shown going into the staging area via "git add", and then into the repository via "git commit".

Let’s watch as our changes to a file move from our editor to the staging area and into long-term storage. First, we’ll add another line to the file:

Input: Add and review new line with `nano` and `cat`
$ nano colonel.txt
$ cat colonel.txt
Output
No alibi for the night of murder.
No clear motive. Seems high unlikely.
Fingerprints on victims glasses.
Input: Review changes
$ git diff
Output
diff --git a/colonel.txt b/colonel.txt
index 315bf3a..b36abfd 100644
--- a/colonel.txt
+++ b/colonel.txt
@@ -1,2 +1,3 @@
 No alibi for the night of murder.
 No clear motive. Seems high unlikely.
+Fingerprints on victims glasses.

So far, so good: we’ve added one line to the end of the file (shown with a + in the first column). Now let’s put that change in the staging area and see what git diff reports:

Input: See changes
$ git add colonel.txt
$ git diff

But, there is no output: as far as Git can tell, there’s no difference between what it’s been asked to save permanently and what’s currently in the directory. However, if we do this:

Input: See what is in the staging area
$ git diff --staged
Output
diff --git a/colonel.txt b/colonel.txt
index 315bf3a..b36abfd 100644
--- a/colonel.txt
+++ b/colonel.txt
@@ -1,2 +1,3 @@
 No alibi for the night of murder.
 No clear motive. Seems high unlikely.
+Fingerprints on victims glasses.

it shows us the difference between the last committed change and what’s in the staging area. Let’s save our changes:

Input: Save changes
$ git commit -m "Make notes about colonel's fingerprints"
Output
[main 005937f] Make notes about colonel's fingerprints
 1 file changed, 1 insertion(+)
Input: Check new status
$ git status
Output
On branch main
nothing to commit, working directory clean
Input: and look at the history of what we've done so far:
$ git log
Output
commit 005937fbe2a98fb83f0ade869025dc2636b4dad5 (HEAD -> main)
Author: Holmes Sherlock <Holmes@tran.sylvan.ia>
Date:   Thu Aug 22 10:14:07 2013 -0400

    Make notes about colonel's fingerprints

commit 34961b159c27df3b475cfe4415d94a6d1fcd064d
Author: Sherlock Holmes <sherlock@baker.street>
Date:   Thu Aug 22 10:07:21 2013 -0400

    Add concerns about existence of motive for colonel

commit f22b25e3233b4645dabd0d81e651fe074bd8e73b
Author: Sherlock Holmes <sherlock@baker.street>
Date:   Thu Aug 22 09:51:46 2013 -0400

    Start notes on colonel as a suspect

Sometimes, e.g. in the case of the text documents a line-wise diff is too coarse. That is where the --color-words option of git diff comes in very useful as it highlights the changed words using colors.

When the output of git log is too long to fit in your screen, git uses a program to split it into pages of the size of your screen. When this “pager” is called, you will notice that the last line in your screen is a :, instead of your usual prompt.

  • To get out of the pager, press Q.
  • To move to the next page, press Spacebar.
  • To search for some_word in all pages, press and type some_word. Navigate through matches pressing N.

To avoid having git log cover your entire terminal screen, you can limit the number of commits that Git lists by using -N, where N is the number of commits that you want to view. For example, if you only want information from the last commit you can use:

Input: See specific number of commits
$ git log -1
Output
commit 005937fbe2a98fb83f0ade869025dc2636b4dad5 (HEAD -> main)
Author: Sherlock Holmes <sherlock@baker.street>
Date:   Thu Aug 22 10:14:07 2013 -0400

   Make notes about colonel's fingerprints

You can also reduce the quantity of information using the --oneline option:

Input: See only basic information
$ git log --oneline
Output
005937f (HEAD -> main) Make notes about colonel's fingerprints
34961b1 Add concerns about existence of motive for colonel
f22b25e Start notes on colonel as a base

You can also combine the --oneline option with others. One useful combination adds --graph to display the commit history as a text-based graph and to indicate which commits are associated with the current HEAD, the current branch main, or other Git references:

Input: Extra options
$ git log --oneline --graph
Output
* 005937f (HEAD -> main) Make notes about colonel's fingerprints
* 34961b1 Add concerns about existence of motive for colonel
* f22b25e Start notes on colonel as a base

Two important facts you should know about directories in Git.

  1. Git does not track directories on their own, only files within them. Try it for yourself:

    Input: Test directory tracking
    $ mkdir mysteries
    $ git status
    $ git add mysteries
    $ git status
    

    Note, our newly created empty directory mysteries does not appear in the list of untracked files even if we explicitly add it (via git add) to our repository. This is the reason why you will sometimes see .gitkeep files in otherwise empty directories. Unlike .gitignore, these files are not special and their sole purpose is to populate a directory so that Git adds it to the repository. In fact, you can name such files anything you like.

  2. If you create a directory in your Git repository and populate it with files, you can add all files in the directory at once by:

     git add <directory-with-files>
    

    Try it for yourself:

    Input: Add multiple files
    $ touch mysteries/belgravia mysteries/reichenbachfall
    $ git status
    $ git add mysteries
    $ git status
    

    Before moving on, we will commit these changes.

    Input: Commit the new changes
    $ git commit -m "Add some initial thoughts on mysteries"
    

To recap, when we want to add changes to our repository, we first need to add the changed files to the staging area (git add) and then commit the staged changes to the repository (git commit):

The Git Commit Workflow.

Let’s put us to the test

Question: Choosing a Commit Message

Which of the following commit messages would be most appropriate for the last commit made to colonel.txt?

  1. “Changes”
  2. “Added line ‘Fingerprints on victims glasses.’ to colonel.txt”
  3. “Make notes about colonel’s fingerprints”

Answer 1 is not descriptive enough, and the purpose of the commit is unclear; and answer 2 is redundant to using “git diff” to see what changed in this commit; but answer 3 is good: short, descriptive, and imperative.

Question: Committing Changes to Git

Which command(s) below would save the changes of myfile.txt to my local Git repository?

  1. $ git commit -m "my recent changes"
    
  2. $ git init myfile.txt
    $ git commit -m "my recent changes"
    
  3. $ git add myfile.txt
    $ git commit -m "my recent changes"
    
  4. $ git commit -m myfile.txt "my recent changes"
    
  1. Would only create a commit if files have already been staged.
  2. Would try to create a new repository.
  3. Is correct: first add the file to the staging area, then commit.
  4. Would try to commit a file “my recent changes” with the message myfile.txt.
Question: Committing Multiple Files

The staging area can hold changes from any number of files that you want to commit as a single snapshot.

  1. Add some text to colonel.txt noting your suspicion on judge Brown.
  2. Create a new file judge.txt with your initial thoughts about the judge as a suspect.
  3. Add changes from both files to the staging area, and commit those changes.

The output below from cat colonel.txt reflects only content added during this exercise. Your output may vary.

First we make our changes to the colonel.txt and judge.txt files:

Input: Edit `colonel.txt`
$ nano colonel.txt
$ cat colonel.txt
Output
Maybe judge Brown should also be considerable as a suspect.
Input: Create and edit `judge.txt`
$ nano judge.txt
$ cat judge.txt
Output
Judge seems like a nice guy, but has a shady past.

Now you can add both files to the staging area. We can do that in one line:

Input: Add both files
$ git add colonel.txt judge.txt
Input: Or with multiple commands:
$ git add colonel.txt
$ git add judge.txt

Now the files are ready to commit. You can check that using git status. If you are ready to commit use:

Input: Commit changes
$ git commit -m "Write plans to start a base on judge"
Output
[main cc127c2]
 Write plans to start a base on judge
 2 files changed, 2 insertions(+)
 create mode 100644 judge.txt
Question: `bio` Repository
  1. Create a new Git repository on your computer called bio.
  2. Write a three-line biography for yourself in a file called me.txt, commit your changes
  3. Modify one line, add a fourth line
  4. Display the differences between its updated state and its original state.

If needed, move out of the suspects folder:

Input: Change directory
$ cd ..

Create a new folder called bio and ‘move’ into it:

Create folder

$ mkdir bio
$ cd bio
Input: Initialise git:
$ git init

Create your biography file me.txt using nano or another text editor. Once in place, add and commit it to the repository:

Input: Create file and edit it
$ git add me.txt
$ git commit -m "Add biography file"

Modify the file as described (modify one line, add a fourth line). To display the differences between its updated state and its original state, use git diff:

Input: Display the differences
$ git diff me.txt

History Exploring

As we saw in the previous episode, we can refer to commits by their identifiers. You can refer to the most recent commit of the working directory by using the identifier HEAD.

We’ve been adding one line at a time to colonel.txt, so it’s easy to track our progress by looking, so let’s do that using our HEADs. Before we start, let’s make a change to colonel.txt, adding yet another line.

Input: Edit `colonel.txt`
$ nano colonel.txt
$ cat colonel.txt
Output
No alibi for the night of murder.
No clear motive. Seems high unlikely.
Fingerprints on victims glasses.
Maybe judge Brown should also be considerable as a suspect.

Now, let’s see what we get.

Input: Display the changes
$ git diff HEAD colonel.txt
Output
diff --git a/colonel.txt b/colonel.txt
index b36abfd..0848c8d 100644
--- a/colonel.txt
+++ b/colonel.txt
@@ -1,3 +1,4 @@
 No alibi for the night of murder.
 No clear motive. Seems high unlikely.
 Fingerprints on victims glasses.
+Maybe judge Brown should also be considerable as a suspect.

which is the same as what you would get if you leave out HEAD (try it). The real goodness in all this is when you can refer to previous commits. We do that by adding ~1 (where “~” is “tilde”, pronounced [til-duh]) to refer to the commit one before HEAD.

Input: Refer to previous commits
$ git diff HEAD~1 colonel.txt

If we want to see the differences between older commits we can use git diff again, but with the notation HEAD~1, HEAD~2, and so on, to refer to them:

Input: Refer to previous commits
$ git diff HEAD~3 colonel.txt
Output
diff --git a/colonel.txt b/colonel.txt
index df0654a..b36abfd 100644
--- a/colonel.txt
+++ b/colonel.txt
@@ -1 +1,4 @@
 No alibi for the night of murder.
+No clear motive. Seems high unlikely.
+Fingerprints on victims glasses.
+Maybe judge Brown should also be considerable as a suspect.

We could also use git show which shows us what changes we made at an older commit as well as the commit message, rather than the differences between a commit and our working directory that we see by using git diff.

Input: `git show` and `git diff` differences
$ git show HEAD~3 colonel.txt
Output
commit f22b25e3233b4645dabd0d81e651fe074bd8e73b
Author: Sherlock Holmes <sherlock@baker.street>
Date:   Thu Aug 22 09:51:46 2013 -0400

    Start notes on colonel as a base

diff --git a/colonel.txt b/colonel.txt
new file mode 100644
index 0000000..df0654a
--- /dev/null
+++ b/colonel.txt
@@ -0,0 +1 @@
+No alibi for the night of murder.

In this way, we can build up a chain of commits. The most recent end of the chain is referred to as HEAD; we can refer to previous commits using the ~ notation, so HEAD~1 means “the previous commit”, while HEAD~123 goes back 123 commits from where we are now.

We can also refer to commits using those long strings of digits and letters that git log displays. These are unique IDs for the changes, and “unique” really does mean unique: every change to any set of files on any computer has a unique 40-character identifier. Our first commit was given the ID f22b25e3233b4645dabd0d81e651fe074bd8e73b, so let’s try this:

Input: Display specific commit
$ git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b colonel.txt
Output
diff --git a/colonel.txt b/colonel.txt
index df0654a..93a3e13 100644
--- a/colonel.txt
+++ b/colonel.txt
@@ -1 +1,4 @@
 No alibi for the night of murder.
+No clear motive. Seems high unlikely.
+Fingerprints on victims glasses.
+Maybe judge Brown should also be considerable as a suspect.

That’s the right answer, but typing out random 40-character strings is annoying, so Git lets us use just the first few characters (typically seven for normal size projects):

Input: Shorter alternative
$ git diff f22b25e colonel.txt
Output
diff --git a/colonel.txt b/colonel.txt
index df0654a..93a3e13 100644
--- a/colonel.txt
+++ b/colonel.txt
@@ -1 +1,4 @@
 No alibi for the night of murder.
+No clear motive. Seems high unlikely.
+Fingerprints on victims glasses.
+Maybe judge Brown should also be considerable as a suspect.

All right! So we can save changes to files and see what we’ve changed. Now, how can we restore older versions of things? Let’s suppose we change our mind about the last update to colonel.txt (the “ill-considered change”).

git status now tells us that the file has been changed, but those changes haven’t been staged:

Input: Check
$ git status
Output
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   colonel.txt

no changes added to commit (use "git add" and/or "git commit -a")

We can put things back the way they were by using git checkout:

Input: Restore with `git checkout`
$ git checkout HEAD colonel.txt
$ cat colonel.txt
Output
No alibi for the night of murder.
No clear motive. Seems high unlikely.
Fingerprints on victims glasses.

As you might guess from its name, git checkout checks out (i.e., restores) an old version of a file. In this case, we’re telling Git that we want to recover the version of the file recorded in HEAD, which is the last saved commit. If we want to go back even further, we can use a commit identifier instead:

Input: Restrore specific commit
$ git checkout f22b25e colonel.txt
Input: Check
$ cat colonel.txt
Output
No alibi for the night of murder.
Input: Check
$ git status
Output
On branch main
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    modified:   colonel.txt

Notice that the changes are currently in the staging area. Again, we can put things back the way they were by using git checkout:

Input: Restore
$ git checkout HEAD colonel.txt
Warning: Don't Lose Your HEAD

Above we used

$ git checkout f22b25e colonel.txt

to revert colonel.txt to its state after the commit f22b25e. But be careful! The command checkout has other important functionalities and Git will misunderstand your intentions if you are not accurate with the typing. For example, if you forget colonel.txt in the previous command.

Input: Error recipe
$ git checkout f22b25e
Output
Note: checking out 'f22b25e'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

git checkout -b <new-branch-name>

HEAD is now at f22b25e Start notes on colonel as a base

The “detached HEAD” is like “look, but don’t touch” here, so you shouldn’t make any changes in this state. After investigating your repo’s past state, reattach your HEAD with git checkout main.

It’s important to remember that we must use the commit number that identifies the state of the repository before the change we’re trying to undo. A common mistake is to use the number of the commit in which we made the change we’re trying to discard. In the example below, we want to retrieve the state from before the most recent commit (HEAD~1), which is commit f22b25e:

Git Checkout.

So, to put it all together, here’s how Git works in cartoon form:

https://figshare.com/articles/How_Git_works_a_cartoon/1328266.

If you read the output of git status carefully, you’ll see that it includes this hint:

(use "git checkout -- <file>..." to discard changes in working directory)

As it says, git checkout without a version identifier restores files to the state saved in HEAD. The double dash -- is needed to separate the names of the files being recovered from the command itself: without it, Git would try to use the name of the file as the commit identifier.

The fact that files can be reverted one by one tends to change the way people organize their work. If everything is in one large document, it’s hard (but not impossible) to undo changes to the introduction without also undoing changes made later to the conclusion. If the introduction and conclusion are stored in separate files, on the other hand, moving backward and forward in time becomes much easier.

Question: Recovering Older Versions of a File

Jennifer has made changes to the Python script that she has been working on for weeks, and the modifications she made this morning “broke” the script and it no longer runs. She has spent ~ 1hr trying to fix it, with no luck…

Luckily, she has been keeping track of her project’s versions using Git! Which commands below will let her recover the last committed version of her Python script called data_cruncher.py?

  1. $ git checkout HEAD

  2. $ git checkout HEAD data_cruncher.py

  3. $ git checkout HEAD~1 data_cruncher.py

  4. $ git checkout <unique ID of last commit> data_cruncher.py

  5. Both 2 and 4

The answer is (5)-Both 2 and 4.

The checkout command restores files from the repository, overwriting the files in your working directory. Answers 2 and 4 both restore the latest version in the repository of the file data_cruncher.py. Answer 2 uses HEAD to indicate the latest, whereas answer 4 uses the unique ID of the last commit, which is what HEAD means.

Answer 3 gets the version of data_cruncher.py from the commit before HEAD, which is NOT what we wanted.

Answer 1 can be dangerous! Without a filename, git checkout will restore all files in the current directory (and all directories below it) to their state at the commit specified. This command will restore data_cruncher.py to the latest commit version, but it will also restore any other files that are changed to that version, erasing any changes you may have made to those files! As discussed above, you are left in a detached HEAD state, and you don’t want to be there.

Question: Reverting a Commit

Jennifer is collaborating with colleagues on her Python script. She realizes her last commit to the project’s repository contained an error, and wants to undo it. Jennifer wants to undo correctly so everyone in the project’s repository gets the correct change. The command git revert [erroneous commit ID] will create a new commit that reverses the erroneous commit.

The command git revert is different from git checkout [commit ID] because git checkout returns the files not yet committed within the local repository to a previous state, whereas git revert reverses changes committed to the local and project repositories.

Below are the right steps and explanations for Jennifer to use git revert, what is the missing command?

  1. ________ # Look at the git history of the project to find the commit ID

  2. Copy the ID (the first few characters of the ID, e.g. 0b1d055).

  3. git revert [commit ID]

  4. Type in the new commit message.

  5. Save and close

The command git log lists project history with commit IDs.

The command git show HEAD shows changes made at the latest commit, and lists the commit ID; however, Jennifer should double-check it is the correct commit, and no one else has committed changes to the repository.

Question: Understanding Workflow and History

What is the output of the last command in

$ cd suspects
$ echo "judge has unresolved military issues" > judge.txt
$ git add judge.txt
$ echo "judge has enemies in the city" >> judge.txt
$ git commit -m "Comment on judge as a suspect"
$ git checkout HEAD judge.txt
$ cat judge.txt #this will print the contents of judge.txt to the screen
  1. judge has enemies in the city
    
  2. judge has unresolved military issues
    
  3. judge has unresolved military issues
    judge has enemies in the city
    
  4. Error because you have changed judge.txt without committing the changes
    

The answer is 2.

The command git add judge.txt places the current version of judge.txt into the staging area. The changes to the file from the second echo command are only applied to the working copy, not the version in the staging area.

So, when git commit -m "Comment on judge as an unsuitable base" is executed, the version of judge.txt committed to the repository is the one from the staging area and has only one line.

At this time, the working copy still has the second line (and git status will show that the file is modified). However, git checkout HEAD judge.txt replaces the working copy with the most recently committed version of judge.txt.

So, cat judge.txt will output

 judge has unresolved military issues.
Question: Checking Understanding of `git diff

Consider this command: git diff HEAD~9 colonel.txt. What do you predict this command will do if you execute it? What happens when you do execute it? Why?

Try another command, git diff [ID] colonel.txt, where [ID] is replaced with the unique identifier for your most recent commit. What do you think will happen, and what does happen?

Question: Getting Rid of Staged Changes

git checkout can be used to restore a previous commit when unstaged changes have been made, but will it also work for changes that have been staged but not committed? Make a change to colonel.txt, add that change, and use git checkout to see if you can remove your change.

Question: Explore and Summarize Histories

Exploring history is an important part of Git, and often it is a challenge to find the right commit ID, especially if the commit is from several months ago.

Imagine the suspects project has more than 50 files. You would like to find a commit that modifies some specific text in colonel.txt. When you type git log, a very long list appeared. How can you narrow down the search?

Recall that the git diff command allows us to explore one specific file, e.g., git diff colonel.txt. We can apply a similar idea here.

$ git log colonel.txt

Unfortunately some of these commit messages are very ambiguous, e.g., update files. How can you search through these files?

Both git diff and git log are very useful and they summarize a different part of the history for you. Is it possible to combine both? Let’s try the following:

$ git log --patch colonel.txt

You should get a long list of output, and you should be able to see both commit messages and the difference between each commit.

Question: What does the following command do?

$ git log --patch HEAD~9 *.txt
Key points
  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.

  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

  • git init initializes a repository.

  • Git stores all of its repository data in the .git directory.

  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write a commit message that accurately describes your changes.

  • git diff displays differences between commits.

  • git checkout recovers old versions of files.

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Foundations of Data Science topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. , 2022 Version Control with Git (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/data-science/tutorials/bash-git/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{data-science-bash-git,
author = "Sofoklis Keisaris",
title = "Version Control with Git (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/data-science/tutorials/bash-git/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Funding

These individuals or organisations provided funding support for the development of this resource

This project (2020-1-NL01-KA203-064717) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics. eu flag with the text: with the support of the erasmus programme of the european union

Congratulations on successfully completing this tutorial!