Connecting Galaxy to a compute cluster

name: inverse
layout: true
class: center, middle, inverse

</span></div>

</span></div>

---

# Connecting Galaxy to a compute cluster

<div class="contributors-line">
		Authors: 
<a href="/training-material/hall-of-fame/natefoo/" class="contributor-badge contributor-natefoo"><img src="https://avatars.githubusercontent.com/natefoo?s=27" alt="Avatar">Nate Coraor</a>

<a href="/training-material/hall-of-fame/bgruening/" class="contributor-badge contributor-bgruening"><img src="/training-material/assets/images/orcid.png" alt="orcid logo"/><img src="https://avatars.githubusercontent.com/bgruening?s=27" alt="Avatar">Björn Grüning</a>

<a href="/training-material/hall-of-fame/nsoranzo/" class="contributor-badge contributor-nsoranzo"><img src="/training-material/assets/images/orcid.png" alt="orcid logo"/><img src="https://avatars.githubusercontent.com/nsoranzo?s=27" alt="Avatar">Nicola Soranzo</a>

<a href="/training-material/hall-of-fame/hexylena/" class="contributor-badge contributor-hexylena"><img src="/training-material/assets/images/orcid.png" alt="orcid logo"/><img src="https://avatars.githubusercontent.com/hexylena?s=27" alt="Avatar">Helena Rasche</a>

</div>

<div class="footnote" style="bottom: 8em;"><i class="far fa-calendar" aria-hidden="true"></i><span class="visually-hidden">last_modification</span> Updated: Apr 6, 2021</div>

<div class="footnote" style="bottom: 6em;">

<i class="far fa-play-circle" aria-hidden="true"></i><span class="visually-hidden">video-slides</span> <a href="/training-material/videos/watch.html?v=/admin/tutorials/connect-to-compute-cluster/slides">Video slides</a> |

<i class="fas fa-file-alt" aria-hidden="true"></i><span class="visually-hidden">text-document</span><a href="slides-plain.html"> Plain-text slides</a>
</div>

<div class="footnote" style="bottom: 2em;">
    <strong>Tip: </strong>press <kbd>P</kbd> to view the presenter notes
    | <i class="fa fa-arrows" aria-hidden="true"></i><span class="visually-hidden">arrow-keys</span> Use arrow keys to move between slides

</div>

???
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press `P` again to switch presenter notes off

Press `C` to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.

Useful when presenting.

---

### <i class="far fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions

- How to connect Galaxy to a compute cluster?

- How can I configure job dependent resources, like cores, memory for my DRM?

---

### <i class="fas fa-bullseye" aria-hidden="true"></i><span class="visually-hidden">objectives</span> Objectives

- Understand all components of the Galaxy job running stack

- Understand how the `job_conf.xml` file controls Galaxy's jobs subsystem

- Know how to map tools to job destinations

- The various ways in which tools can be mapped to destinations, both statically and dynamically

---

# Galaxy Job Configuration

- Configured in `config/job_conf.xml`
- XML format with macro support
- Major components:
  - **Plugins**: distributed resource manager (DRM) modules to load
  - **Handlers**: job handler processes managing the lifecycle of jobs
  - **Destinations**: where to send jobs, and what parameters to run those jobs with
  - **Tool** to destination/handler mappings
  - **Resource** selection mappings: give users job execution options on the tool form
  - **Limits**: job runtime limits, e.g. the max number of concurrent jobs

???

- The job_conf file is a very powerul galaxy configuration piece critical to smooth cluster operation.
- Written in XML it connects your server with the available cluster resources.
- You can configure it in myriad ways.
- Study the advanced sample provided with codebase once you get a basic understanding.
- There are several major components of the job conf file.
- Plugins, handlers, destinations, tools, resources, and limits.
- We'll go into detail on each of these in the tutorial.

---

# Why cluster?

Running jobs on the Galaxy server negatively impacts Galaxy UI performance

Even adding one other host helps

Can restart Galaxy without interrupting jobs

???

- Galaxy itself is not resource hungry, but the jobs often are.
- Offloading the jobs to different machines is a more sustainable and reliable setup.
- This can prevent user jobs from making Galaxy unresponsive.

---

# Plugins

Correspond to job runner plugins in [lib/galaxy/jobs/runners](https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/jobs/runners)

.left[Plugins for:]
- local
- Slurm (DRMAA subclass)
- DRMAA: SGE, PBS Pro, LSF, Torque
- HTCondor
- Torque: Using the `pbs_python` library
- Pulsar: Galaxy's own remote job management system
- Command Line Interface (CLI) via SSH
- Kubernetes
- Go-Docker
- Chronos

???

- Galaxy supports plugins for various job runners covering most of the popular DRMs.
- The Galaxy community also maintains its own job management system called Pulsar.
- If the scheduler you use is missing, talk to us!

---

# Cluster library stack (DRMAA)

![Cluster library stack](../../images/cluster_lib_stack.png)

???

- The cluster library stack we use in this tutorial will use DRMAA.
- DRMAA is an interface that many distributed resource managers provide.
- Galaxy can use DRMAA to interact with these in an agnostic manner.
- However, there are more underlying technologies that you are going to depend on.
- You don't need to have an in-depth understanding to run cluster deployment correctly.

---

# Handlers

Define which Galaxy processes are job handlers

- `id` attribute should match the `--server-name` param value of a process
- Dedicated handlers can be reserved, e.g. for small, high throughput jobs
- The list of plugins that are loaded by a job handler can be limited using `<plugin>` subelements (e.g. when the DRMAA plugin needs to be loaded with different library paths)
- Not defined in `job_conf.xml` when using *job handler mules*
- Defines how jobs are assigned to individual processes (use `db-skip-locked` !)

???

- Handlers are the Galaxy processes which interact with the cluster.
- You can define dedicated handlers for different types of jobs, or to interact with different clusters.
- Additionally, handlers definition in the job configuration controls how jobs are assigned to individual processes.
- There are many options for the assignment process, all are discussed in the advanced sample job configuration.
- db-skip-locked is the best choice for most cases, it enables handlers to grab multiple jobs to work on at once.

---

# Destinations

Define *how* jobs should be run

- Which plugin? (Slurm, Condor, Pulsar, etc?)
- In a Docker container? Which one?
- **DRM params** (queue, cores, memory, walltime)?
- Environment (variables e.g. `$PATH`, source an env file, run a command)?

???

- The destination section of the job configuration file is a map that defines which jobs go where.
- Jobs from any destination, can be processed by any plugin.
- Every job will find a path through this configuration and eventually get dispatched to the matching runner.
- These destinations can specify things like environment variables or resource requirements.

---

# The default job configuration

.left[`config/job_conf.xml.sample_basic`:]
```xml
<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="4"/>
    </plugins>
    <destinations>
        <destination id="local" runner="local"/>
    </destinations>
</job_conf>
```

???

- This is the default job configuration.
- It uses a local runner with 4 workers, or processes to process jobs.
- As a result if you restart Galaxy, jobs will be lost.

---

# Job Config - Tags

Both destinations and handlers can be grouped by **tags**

- Allows random selection from multiple resources
- Allows concurrency limits at the destination group level

???

- Tags can be applied to both destinations and handlers.
- This permits selecting randomly amongst the handlers or destinations.
- Tags can help the load distribution.

---

# Job Environment

`<env>` tag in destinations: configure the job exec environment

| tag syntax | function |
| ---- | ---- |
| `<env id="NAME">VALUE</env>` | Set `$NAME` to `VALUE` |
| `<env file="/path/to/file" />` | Source shell file at `/path/to/file` |
| `<env exec="CMD" />` | Execute `CMD` |

Source and command execution will be handled on the remote destination, don't need to work on the Galaxy server

???

- You can specify environment variables on the destination.
- Galaxy will ensure these are executed in the same environment and ahead of the job.

---

# Limits

Available limits

- Walltime (if not available with your DRM)
- Output size (if *any* tool output grows larger than this limit)
- Concurrency: Number of "active" (queued or running) jobs

???

- Configuration of job limits is best acommplished using both the DRM provided limits and the ones from Galaxy.
- Walltime is best set in your DRM, while output size is only possible through Galaxy.
- We recommend you set these at the DRM level which is better equipped to terminate misbehaving jobs.
- The most important limit however is usually concurrency.

---

# Concurrency Limits

Available limits

- Number of active jobs per registered user
- Number of active jobs per unregistered user
- Number of active jobs per registered user in a specified destination or destination tag
- Number of total active jobs in a specified destination or destination tag

???

- Using concurrency limits lets you ensure quality of service for everyone.
- By limiting jobs per user, you can prevent a single user from overwhelming the server, and ensure everyone can run jobs.
- Additionally with concurrency limits you can balance your instance between internal and external users.

---

# Shared Filesystem

Most job plugins require a **shared filesystem** between the Galaxy server and compute.

The exception is **Pulsar**. More on this in *Using heterogeneous compute resources*

???

- Most DRMs require a shared filesystem to ensure datasets are available to the jobs.
- Galaxy's Pulsar does not, and can be used in situations where no shared filesystem is available.

---

# Shared Filesystem

Our simple example works because of two important principles:

1. Some things are located *at the same path* on Galaxy server and node(s)
  - Galaxy application (`/srv/galaxy/server`)
  - Tool dependencies
2. Some things *are the same* on Galaxy server and node(s)
  - Job working directory
  - Input and output datasets

The first can be worked around with symlinks or Pulsar embedded (later)

The second can be worked around with Pulsar REST/MQ (with a performance/throughput penalty)

???

- For the DRMs which require a shared filesystem there are additional requirements.
- First, Galaxy and the tool dependencies are at the same location on the head and compute nodes.
- Job directories must be in a shared location on both head and compute nodes.
- This is mentioned in more detail in the tutorial.

---

# Multiprocessing

Some tools can greatly improve performance by using multiple cores

Galaxy automatically sets `$GALAXY_SLOTS` to the CPU/core count you specify when submitting, for example, 4:
- Slurm: `sbatch --ntasks=4`
- SGE: `qsub -pe threads 4`
- Torque/PBS Pro: `qsub -l nodes=1:ppn=4`
- Condor: `request_cpus: 4`

Tool configs: Consume `\${GALAXY_SLOTS:-4}`

???

- For multiprocessing to be available both the tool and the Galaxy tool wrapper need to support it.
- You need to understand what tools are being run and set destinations for them with the appropriate specification.
- You'll need to check for presence of GALAXY_SLOTS in the tool wrappers and tool macros to see if the tool supports multiple threads.

---

# Memory requirements

For **Slurm only**, Galaxy will set `$GALAXY_MEMORY_MB` and `$GALAXY_MEMORY_MB_PER_SLOT` as integers.

**Other DRMs:** Please PR the [appropriate code](https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/util/job_script/MEMORY_STATEMENT.sh).

For Java tools, be sure to set `-Xmx`, e.g.:

```xml
<destination id="foo" ...>
    <env id="_JAVA_OPTIONS">-Xmx4096m</env>
</destination>
```

???

- Memory requirements can be set as well.
- For some tools, you'll need to additionally provide environment variables to specify memory limits.
- This is different per DRM.

---

# Run jobs as the "real" user

If your Galaxy users == System users:
- Submit jobs to cluster as the actual user
- Configurable callout scripts before/after job to change ownership
- Probably requires limited sudo for Galaxy user

See: [Cluster documentation](https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster)

???

- If you galaxy users map to the system users you can have Galaxy run the jobs with the account of those users.
- This allows proper resource accounting, but comes at some additional configuration complexities.

---

## Job Config - Mapping Tools to Destinations

Problem: Tool A uses single core, Tool B uses multiple
- Both submit to the same cluster
- Need different submit parameters (`--ntasks=1` vs. `--ntasks=4` in Slurm)

???

- Mapping tools to destinations is the heart of the job configuration.
- This permits you to define which tools go to which destinations, and what resources they need.

---

## Job Config - Mapping Tools to Destinations

Solution:

```xml
    <destinations default="single">
        <destination id="single" runner="slurm" />
        <destination id="multi" runner="slurm">
            <param id="nativeSpecification">--ntasks=4</param>
        </destination>
    </destinations>
    <tools>
        <tool id="hisat2" destination="multi"/>
    </tools>
```

???

- Here is an example mapping the hisat2 tool to a definition named multi.
- The multi destination specifies that 4 cores should be allocated for each job, and uses the slurm plugin.

---

# The Dynamic Job Runner

For when basic tool-to-destination mapping isn't enough

???

- However this static mapping sometimes isn't sufficient.
- Here a dynamic mapping can be used instead.
- Galaxy has several different methods for accomplishing this.

---

## The Dynamic Job Runner

A special built-in job runner plugin

Map jobs to destinations on more than just tool IDs

.left[Two types:]
- Dynamic Tool Destinations
- Python function

See: [Dynamic Destination Mapping](https://docs.galaxyproject.org/en/master/admin/jobs.html#dynamic-destination-mapping)

???

- There are two built in ways to do this: dynamic tool destinations, and custom Python functions.
- We will cover both of these in the tutorial.

---

## Dynamic Tool Destinations

.left[Configurable mappings without programming:]
- YAML format config file `tool_destinations.yml`
- Map based on tool ID plus:
  - Input dataset size(s)
  - Input dataset number of records
  - User
- Maps to static destinations defined in job config

???

- The Dynamic Tool Destinations are written as a yaml file.
- You can easily write rules based on file input sizes or number of inputs or user information.
- This can be used to determine memory and cpu allocations.

---

## Arbitrary Python Functions

.left[Programmable mappings:]
- Written as Python function in `lib/galaxy/jobs/rules/`
- Map based on:
  - Tool ID
  - User email or username
  - Inputs
  - Tool Parameters
  - Defined "helper" functions based on DB contents
  - Anything else discoverable
    - Cluster queue depth?
    - ...?
- Can dynamically modify destinations in job config (i.e. `sbatch` params)

???

- If Dynamic Tool Destinations are insufficiently flexible, then custom Python functions can be written.
- These can use any arbitrary information you want.
- They have full access to submitter information, job parameters, and any other resource you might want.
- They can dynamically modify destination parameters during runtime.
- If you need flexibility, these are what you want.

---
### <i class="fas fa-key" aria-hidden="true"></i><span class="visually-hidden">keypoints</span> Key points

- Galaxy supports a variety of different DRMs.

- Dynamic Tool Destinations are a convenient way to map

- Job resource parameters can allow you to give your users control over job resource requirements, if they are knowledgeable about the tools and compute resources available to them.

---

## Thank You!

This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!

</div>

</div>

<a rel="license" href="https://creativecommons.org/licenses/by/4.0/">
This material is licensed under the Creative Commons Attribution 4.0 International License</a>.