Galaxy Database schema

Overview
Questions:
  • Running a production Galaxy server, you some times end up in with a situation, where you manually need to interact with the Galaxy database: how do you do that

  • How to extract usage information, which can not be gathered using the given report tools

  • How to move from MySQL to PostgreSQL

  • Is there ever a need to manually change the contents of a table

Objectives:
  • Learn some of the design concepts of the Galaxy database

  • Extract information from the Galaxy database

  • Get to know SchemaSpy

Time estimation: 2 hours
Supporting Materials:
Last modification: Sep 28, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Galaxy Database Schema

Requirements

For the hands-on examples you need access to a Galaxy server and access to its PostgreSQL database. You can set-up this yourself, or use the Galaxy Docker Image provided by Björn Grüning (https://github.com/bgruening/docker-galaxy-stable). During this tutorial, we will work with the Galaxy Docker Image.

Setting up Docker and using the Galaxy Docker Image: (please do this before the tutorial, preferably when you are still at home using a fast internet connection)

Follow the instruction to install the Docker engine: https://docs.docker.com/engine/installation/
Execute: docker run -d -p 8080:80 bgruening/galaxy-stable (this will download the Galaxy Docker Image, when executed the first time, and start it) Test the Docker image in your web browser: http://localhost:8080 (see first paragraph of Björn’s introduction, for special cases when using the Docker Toolbox on Mac and Windows). Quit by: docker kill NAME (you get the name with: docker ps )

Introduction

Database versus Object Model

The session description is database centric and we’ll be focusing on the relational database that backs Galaxy servers. But that’s only half the picture of the this data. The other is the object model which is the object-oriented view of this same data. The object model is used by the code to manipulate and access the database. The translation between the two worlds is handled by an object-relational mapping implemented with SQLAlchemy (https://www.sqlalchemy.org).

Today we are covering the database and how to access it with SQL. We aren’t going to cover the corresponding object model or object relational mapping.

Database Platform

The default out-of-the-box Galaxy installation uses SQLite (https://www.sqlite.org). SQLite is a lightweight database management system (DBMS) that can be packaged inside Galaxy and does not require any additional steps at initial setup time.

However, SQLite is not the recommended DBMS for running a Galaxy server. The recommended production DMBS for Galaxy is PostgreSQL (https://www.postgresql.org). PostgreSQL offers a full set of DBMS features and robust support for multiple simultaneous users.

This workshop will be entirely based in PostgreSQL (also referred to as Postgres).

What is in (and not in) the Galaxy database?

The Galaxy database contains management information about your server. The database tracks users, groups, jobs, histories, datasets, workflows and so on.

What’s not in the database is the data. Datasets are stored outside the database. The database does keep metadata – information about the datasets such as data type. The tools themselves are not stored in the database either.

Understanding the Database Schema

ER diagrams and SchemaSpy

Entity-relationship diagrams are a way to understand tables and the relationships between them inside a relational database. SchemaSpy (http://schemaspy.org) is a free (and remarkable tool) for generating ER diagrams. Here’s a diagram of the Galaxy database (generated by SchemaSpy on June 6 2020; 6Mb).

The “Tables” tab is a good place to start learning the structure of the database. Each table represents a different type of thing, and often that thing is itself a relationship. For example, each record in the dataset table has information about a specific dataset, while records in the history_dataset_association table have information about what histories that dataset is in.

Each SchemaSpy table’s page shows the attributes in that table, as well as any constraints on those attributes, and the relationships between that table and other tables.

Also see the “Run SchemaSpy in this container” section below for how to install and then run SchemaSpy yourself.

Database conventions

The Galaxy database uses a number of naming and design conventions. Understanding these can make navigating the database much easier.

id attributes

Every table has an id column that uniquely identifies each row. (The id column is the primary key in database terminology.) Beyond uniquely identifying a row in the table, ID values have no meaning. ID values are unique within a table, but not across the database.

Relationships between tables, and _id columns

Relationships between tables are implemented by exporting id columns from one table into another. Imported ids are called foreign keys in database nomenclature, and are uniformly named table_the_id_came_from_id

There are a few notable exceptions to this rule. If the ID is from a table that is prefixed with galaxy_, for example, galaxy_user or galaxy_session, the galaxy_ will be dropped from the column name. For example, galaxy_user.id becomes user_id in the over 50 tables it is imported into

Relationship tables

As mentioned previously, some tables, such as history_dataset_association represent relationships between things, rather than things themselves. In this case history_dataset_association describes relationships between datasets and histories.

Relationship table names typically contain the names of tables they are relating, suffixed with _association.

Why are nulls allowed in almost every column? We have no idea. In practice, they aren’t nulls in most of those columns.

Why aren’t there comments, on anything? PostgreSQL supports comments to table definitions, but there are none shown in the SchemaSpy report. Why? The table definitions are actually generated by SQLAlchemy, the object-relational mapping software used by Galaxy, and SQLAlchemy does not support it.

There is nothing in the database that results from direct manipulation of the table definitions through DDL. Everything comes in through SQLAlchemy.

Start Docker and Galaxy

Hands-on
  1. Start the Galaxy Docker Image - this time as an interactive session
      docker run -i -t -p 8080:80 bgruening/galaxy-stable /bin/bash
  1. Start Galaxy and its PostgreSQL server
      startup > log 2>&1 &
  1. Follow the startup process
      tail -f log

Important tables

Hands-on
  1. Connect to the PostgreSQL database (change to user galaxy first)
       su galaxy
       psql -d galaxy -U galaxy
  1. List all tables
      \dt

Enter q to exit the view results page, and space to see the next results page.

Table “galaxy_user”

Hands-on: Hands-on
       select * from galaxy_user;

As described in Björn’s introduction, an Admin user is already pre-set (email: ‘admin@galaxy.org’, password: ‘admin’). Now let’s add (i.e. register) a new user via the Galaxy website. And check the database:

Hands-on: Hands-on
       select * from galaxy_user;

Table “job”

Hands-on: Hands-on
       select * from job;

Run a few jobs on the galaxy website (e.g upload file a simple table and add column with ‘Iterate’ no and yes) and check the database again:

Hands-on: Hands-on
       select * from job \x\g\x

Table “job_parameter”

Hands-on: Hands-on
      select * from job_parameter;

Table “history”

Hands-on: Hands-on
      select * from history;

Give your current history a name and check the database again.

Table “dataset”

Hands-on: Hands-on
      select * from dataset;

Table “history_dataset_association”

Hands-on: Hands-on
      select * from history_dataset_association;

More (hands-on) Examples, not covered by the reports app

Have a look at the reports up (which is also provided in the Docker Image):

http://admin:admin@localhost:8080/reports/

Depending on your local needs, some queries are missing, like:

Jobs per tool per year / jobs per tool since 2015

You can add the numbers per month from the reports, or:

Hands-on: Hands-on
      select j.id, j.create_time from job j limit 5;
       select j.id, j.create_time from job j
           where j.create_time >= '2015-12-31'
           and j.create_time < '2016-12-31';
       select j.id,j.create_time from job j
          where EXTRACT(year FROM j.create_time) = 2016
          and j.tool_id='upload1';`

…and now include the user

       select count(j.id) from job j, galaxy_user u
          where j.user_id = u.id
          and u.email = 'hansrudolf.hotz@fmi.ch'
          and EXTRACT(year FROM j.create_time) = 2016
          and j.tool_id='upload1';
       select u.email, count(*) from job j, galaxy_user u
          where j.user_id = u.id
          and EXTRACT(year FROM j.create_time) = 2016
          and j.tool_id='upload1'
          GROUP BY u.email;

Jobs per tool of a certain version

Imagine the current version of a tool is working fine, however a previous version had a bug: now you wanna warn all the users who have used the broken version, without alerting users who never used the broken one.

The following example is from the development server at the FMI

Hands-on
      select distinct(j.tool_version) from job j
          where j.tool_id = 'qAlign';
       select j.user_id from job j
          where j.tool_id = 'qAlign'
          and j.tool_version = '1.0.4quasr';
      select u.email, j.create_time from job j, galaxy_user u
          where j.user_id = u.id
          and j.tool_id = 'qAlign'
          and j.tool_version = '1.0.4quasr';

All users running a job using a certain parameter

Hands-on
      select jp.name, jp.value  from job_parameter jp
          where name = 'iterate'`
      select u.email, jp.name, jp.value
          from job_parameter jp, job j, galaxy_user u
          where jp.name = 'iterate'
          and j.tool_id = 'addValue'
          and jp.job_id = j.id
          and j.user_id = u.id;

Close PostgreSQL client and quit docker

Hands-on

Close the PostgreSQL client

      \q

Quit the interactive docker (change back to root first)

      exit
      exit

Other Topics

How to move from MySQL to PostgreSQL

Slides: https://docs.google.com/presentation/d/1N3BDNQT3s7eQEO3BO89TQbTYwKp92fxHDRWQwC-T1kA

https://wiki.galaxyproject.org/Community/Log/2015/MySQL2PostgreSQL

Is there ever a need to manually change the contents of a table

Slides: https://docs.google.com/presentation/d/1l4DD0IaJjuvk1zAT1Sjv26bLyrSOg3VUm7rD-TQl_Zs

Run SchemaSpy in this container

To run SchemaSpy in your container you’ll need to get it, and also install some required software packages.

Hands-on: Schema Spy
  wget https://github.com/schemaspy/schemaspy/releases/download/v6.1.0/schemaspy-6.1.0.jar
  apt-get update
  apt-get install libpostgresql-jdbc-java
  apt-get install graphviz

To run SchemaSpy:

  java -jar schemaspy-6.1.0.jar -t pgsql -db galaxy -u galaxy -host localhost -s public -dp /usr/share/java/postgresql-jdbc4-9.2.jar -o SpyOut

The SpyOut directory will contain the generated reports and diagrams, anchored at index.html.

Conclusion

There is a lot of information stored in the Galaxy database. Use this information for trouble shooting when necessary and use it as a source for extended user statistics.

Key points
  • Be careful, when you interact with the Galaxy database. And make sure you always have a backup!

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Galaxy Server administration topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Hans-Rudolf Hotz, Björn Grüning, 2022 Galaxy Database schema (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/admin/tutorials/database-schema/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{admin-database-schema,
author = "Hans-Rudolf Hotz and Björn Grüning",
title = "Galaxy Database schema (Galaxy Training Materials)",
year = "2022",
month = "09",
day = "28"
url = "\url{https://training.galaxyproject.org/training-material/topics/admin/tutorials/database-schema/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!