Galaxy Troubleshooting

name: inverse
layout: true
class: center, middle, inverse

</span></div>

</span></div>

---

# Galaxy Troubleshooting

<div class="contributors-line">
		Authors: 
<a href="/training-material/hall-of-fame/martenson/" class="contributor-badge contributor-martenson"><img src="https://avatars.githubusercontent.com/martenson?s=27" alt="Avatar">Martin Čech</a>

<a href="/training-material/hall-of-fame/natefoo/" class="contributor-badge contributor-natefoo"><img src="https://avatars.githubusercontent.com/natefoo?s=27" alt="Avatar">Nate Coraor</a>

</div>

<div class="footnote" style="bottom: 8em;"><i class="far fa-calendar" aria-hidden="true"></i><span class="visually-hidden">last_modification</span> Updated: Jul 20, 2022</div>

<div class="footnote" style="bottom: 6em;">

<i class="fas fa-file-alt" aria-hidden="true"></i><span class="visually-hidden">text-document</span><a href="slides-plain.html"> Plain-text slides</a>
</div>

<div class="footnote" style="bottom: 2em;">
    <strong>Tip: </strong>press <kbd>P</kbd> to view the presenter notes
    | <i class="fa fa-arrows" aria-hidden="true"></i><span class="visually-hidden">arrow-keys</span> Use arrow keys to move between slides

</div>

???
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press `P` again to switch presenter notes off

Press `C` to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.

Useful when presenting.

---

In system administration...
# Everything always goes wrong

---
# When things go wrong, where do you begin?

## LOGS

.left[Check **all** the logs]
- gunicorn/gravity logs
- handler logs (if not all-in-one)
- Pulsar logs
- nginx error and access logs
- syslog/messages
- authlog
- browser console log

---

# Most common problems

- Startup problems
- Web/UI problems
- Tool failures
- Job execution problems
- General performance problems
- Dependency issues
- Other stuff

---
# Startup problems

Where do you begin? In the...

## LOGS

Specifically, gravity/gunicorn and job handler logs.

---

# Database migration

.reduce90[
```
Exception: Your database has version '144' but this code expects version '165'. Please
backup your database and then migrate the database schema by running 'sh manage_db.sh upgrade'.
```
]

Use Ansible! This is permanently solved for you.

Otherwise, upgrade as instructed.

.left[If you believe this message is in error, you can:]
- Check DB table `migrate_version`, column `version`.
- Check folder `lib/galaxy/model/migrate/versions/` - the latest migration should match the DB.
- Clean the `*.pyc` files in migrate versions folder to make sure there is no remnant from other code revisions.
  - See [makepyc.py in galaxyproject.galaxy](https://github.com/galaxyproject/ansible-galaxy/blob/master/files/makepyc.py).

---
class: left
# Database migration

**Downgrading** - Perform in this order:
  1. Downgrade the DB with `manage_db.sh`
  2. Downgrade Galaxy with `git checkout`
  3. Clean `*.pyc`

---
# Stuck in a restart loop

```
[uWSGI] getting YAML configuration from /srv/galaxy/config/galaxy.yml
*** Starting uWSGI 2.0.18 (64bit) on [Wed Mar  4 16:19:27 2020] ***
... bunch of stuff ...
[uWSGI] getting YAML configuration from /srv/galaxy/config/galaxy.yml
*** Starting uWSGI 2.0.18 (64bit) on [Wed Mar  4 16:19:27 2020] ***
... bunch of stuff ...
[uWSGI] getting YAML configuration from /srv/galaxy/config/galaxy.yml
*** Starting uWSGI 2.0.18 (64bit) on [Wed Mar  4 16:19:27 2020] ***
...
```

1. Open the log *without* following
2. Scroll to the end
3. Search/scroll back to the last startup attempt
4. The lines *directly preceding* should offer some insight

---
# Stuck in a restart loop

The reason that Galaxy is dying might not be in the log. If it's not, then

If using **systemd** then it should be in `journalctl -u galaxy`

If using something else, it's wherever `stderr` is being redirected to (supervisor?)

Or maybe the system is killing it... (`/var/log/{messages,syslog}`)

---
# Web/UI Problems

Where do you begin? In the...

## LOGS

Specifically: uWSGI, nginx, and browser console logs

---
# 502 Bad Gateway

Your reverse proxy has failed to connect to the Galaxy socket

- Check that Galaxy is running / not in a reboot loop
- Check that your proxy and Galaxy/uWSGI socket options match
- Check proxy server logs (e.g. `/var/log/nginx`)
- Check uWSGI logs for `bind(): Address already in use [core/socket.c line 769]`
- Check that something is listening on the correct port (`sudo ss -tlpn 'sport = :port'` (demo!) or `sudo lsof -i :port` (demo!))
  - If it's uWSGI, make sure it's not an old one!

---
# 504 Gateway Timeout

Or timeout messages in the Galaxy UI

In uWSGI logs:
```
*** uWSGI listen queue of socket "0.0.0.0:8080" (fd: 6) full !!! (101/100) ***
*** uWSGI listen queue of socket "0.0.0.0:8080" (fd: 6) full !!! (101/100) ***
*** uWSGI listen queue of socket "0.0.0.0:8080" (fd: 6) full !!! (101/100) ***
*** uWSGI listen queue of socket "0.0.0.0:8080" (fd: 6) full !!! (101/100) ***
```

Galaxy is not processing requests in a timely manner

---
# 504 Gateway Timeout

Use `uwsgitop` (demo!)

- `uwsgitop` state busy?

```console
uwsgi-2.0.17.1 - Fri Feb  1 17:13:59 2019 - req: 0 - RPS: 0 - lq: 0 - tx: 0
node: localhost - cwd: /srv/galaxy/server - uid: 999 - gid: 999 - masterpid: 1925
 WID    %       PID     REQ     RPS     EXC     SIG     STATUS  AVG     RSS     VSZ     TX      ReSpwn  HC      RunT    LastSpwn
 1      12.7    3453    59962   1       13      0       busy    89ms    0       0       536.0M  1       0       450m    09:35:36
```

Process state **D**?

```console
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
galaxy    3440 17.2  5.2 1696480 863536 ?      D    Nov10 167:46 uwsgi --yaml ...
```

Kernel uninterruptable sleep. Probably IO. Check filesystems, disks.

Restart Galaxy

.reduce70[.footnote[A later slide will cover how to investigate kernel uninterruptable sleep processes]]

---
# 504 Gateway Timeout or slow UI

Check load

```console
$ uptime
 16:08:15 up 3 days,  8:34,  8 users,  load average: 0.12, 0.23, 0.24
```

Averages are 1, 5, 15 minutes.

Average should be <= number of cores (`lscpu`, beware SMT)

More on troubleshooting load later

---
# Where do you begin? In the...

## LOGS*

*often I now begin in Grafana

![Grafana Load Graph](../../images/troubleshooting/grafana-load-graph.png "Grafana Load Graph")

---
# 504 Gateway Timeout or slow UI

- Investigate load
  - Web server(s)
  - Database server
- Investigate memory usage, swapping
- Investigate iowait

---
# Galaxy UI is slow

- [Tutorial from @mvdbeek](https://docs.galaxyproject.org/en/master/dev/finding_and_improving_slow_code.html#profiling)
- Use Galaxy [heartbeat](https://github.com/galaxyproject/galaxy/blob/8c2066f07ac7bb48c59cf8378e5a852e6fb82a4e/config/galaxy.yml.sample#L1036) (demo!)
- Use `py-spy`

---
# Blank page or no CSS/JavaScript

Serving of static content is broken.

- Check browser console for 404 errors.
- Check proxy error log for permission errors.
- Verify that your proxy static configuration is correct.
- If you have recently upgraded Galaxy or changed the GUI in some way, you will need to rebuild the client

---
# Tool failures

Tools can fail for a variety of reasons, some valid, some invalid.

Some made up examples follow.

---
# Tool missing from Galaxy

- Restart Galaxy and watch the log for
```console
Loaded tool id: toolshed.g2.bx.psu.edu/repos/iuc/sickle/sickle/1.33, version: 1.33 into tool panel....
```

- After startup, check `integrated_tool_panel.xml` for
```xml
<tool id="toolshed.g2.bx.psu.edu/repos/iuc/sickle/sickle/1.33" />
```

- If it is TS tool check `shed_tool_conf.xml` for
```xml
<tool file="toolshed.g2.bx.psu.edu/repos/iuc/sickle/43e081d32f90/sickle/sickle.xml" guid="toolshed.g2.bx.psu.edu/repos/iuc/sickle/sickle/1.33">
...
</tool>
```

- Multiple job handlers? Sometimes they don't all get the update.

---
# Tool errors

Tool stdout/stderr is available in UI under "i" icon on history dataset

.left[Debugging on the filesystem:]
1. Set `cleanup_job` to `onsuccess`
2. Cause a job failure
3. Go to job working directory (find in logs or `/srv/galaxy/jobs/<hash>/<job_id>`)
4. Poke around, try running things (`srun --pty bash` considered useful)

Familiarize yourself with the places Galaxy keeps things (demo?)

---
# Tool errors - stderr

**Galaxy considers *any* output to standard error (stderr) to be an error when no tool profile version is set by a tool.**

Why would it do this????!?!!!11

In the old days, tools were bad about setting the exit code on failure, so it could not be trusted. Galaxy had no functionality to inspect output to decide on failure.

---
class: left
# Tool errors

So what if tool stderr contains:
```console
Congratulations, running SuperAwesome tool was successful!
```

What happens: job fails (!!?!?)

Solutions:
- If the wrapped tool uses proper exit codes, use `<tool profile="16.04">` or later to ignore stderr
- Using current tool development best practices ensures this is the case

---
class: left
# Tool errors

Tool output contains:
```console
Warning: Discarded 10000 lines of /input/dataset_1.dat because they looked funny
```

Maybe a problem, maybe not.

Solutions:
- Check tool input(s) and parameters (maybe requires some biological knowledge)
- Verify input is not corrupt
- User education

---
class: left
# Tool errors - memory errors

Tool output contains one of:
```console
MemoryError                 # Python
what():  std::bad_alloc     # C++
Segmentation Fault          # C - but could be other problems too
Killed                      # Linux OOM Killer
```

Solutions:
- Change input sizes or params
  - Map/reduce?
- Decrease the amount of memory the tool needs
- Increase the amount of memory available to the job
  - Request more memory from cluster scheduler
  - Use job resubmission to automatically rerun with a larger memory allocation
- Cross your fingers and rerun the job

---
class: left
# Tool errors - system errors

Tool output contains:
```console
open(): /input/dataset_1.dat: No such file or directory
```

Solutions:
- Verify that `/input/dataset_1.dat` exists.
  - On node the job ran on
  - As the user the job ran as
- Fix the filesystem error (NFS?) and rerun the job
- See NFS caching errors slide later

---
class: left
# Tool errors - dependency problems

Tool output contains:
```console
sh: command not found: samtools
```

Solutions:
- Verify that `tool_dependency_dir` is accessible *on the cluster, as the user running Galaxy jobs*
- Verify that tool dependencies are properly installed: Galaxy UI "Admin > Manage dependencies"
- Use BioContainers (Docker/Singularity)

---
# Manage Dependencies

An incredibly useful feature to hit unruly tool dependencies with a large hammer.

![Manage Dependencies](../../images/troubleshooting/manage-dependencies.png "Manage Dependencies")

(demo!)

---
# An aside - dependency problems on geriatric Galaxies

Galaxy Tool Shed dependencies are dead.<sup>1</sup> If you don't know what Tool Shed dependencies are (lucky you) skip this slide.

Don't attempt to fix Galaxy Tool Shed dependencies, just update tool and reinstall with Conda.

Put Conda first in [dependency_resolvers_conf.xml](https://github.com/galaxyproject/galaxy/blob/dev/config/dependency_resolvers_conf.xml.sample) if you still need to support both.

.reduce70[.footnote[<sup>1</sup> The Tool Shed should only be used for Galaxy Tool configuration files and wrapper scripts]]

---
class: left
# Tool errors - dependency problems

Tool output contains:
```console
foo: error while loading shared libraries: libdeepthought.so.42: cannot open shared object file: No such file or directory
```

`foo` was compiled against `libdeepthought.so.42` but it's not on the runtime linker path

Solutions:
- Reinstall tool dependencies ("Admin > Manage dependencies"). If it still fails<sup>2</sup>:
  - Determine conda package providing `libdeepthought.so`
  - Downgrade it to the version that provides `libdeepthought.so.42`
- Use BioContainers (Docker/Singularity)

.center[.reduce70[.footnote[<sup>2</sup> First, Google the error because someone named Björn has probably already found and fixed the problem.]]]

???
Galaxy ensures that the correct version(s) of tool(s) immediate dependencies are controlled, but dependencies of dependencies are up to conda. Occasionally, upgrades to these second level dependencies break existing conda packages, requiring package authors to update the first level dependencies to correct the issue.

---
class: left
# Tool errors - Empty green history item

1. The tool is not correctly detecting error conditions: inspect stdout/stderr
2. The tool correctly produced an empty dataset for the given params, inputs

Solutions:
- Fix the tool wrapper to detect errors
- User education

---
# Summary of types of tool failures

- Input/parameter problem (user or tool wrapper author problem)
- Tool wrapper bug (tool wrapper author problem)
- Tool bug (tool wrapper author or tool developer problem)
- Resource problem (sysadmin problem)

Everything else: always the sysadmin's problem

---
# One last word on tool errors

All IUC/devteam tools in the Tool Shed have tests

Use these tests (automateable with [Ephemeris](https://github.com/galaxyproject/ephemeris)!) to verify that the tool works in the basic case

---
# Job execution problems

---
class: smaller
# Jobs aren't running: always gray

Corresponds to `job.state = 'new'` or `'queued'` in database

.left[Check the Galaxy server log for errors. Successful job lifecycle:]

.reduce50[
```console
galaxy.tools DEBUG 2019-02-01 14:30:20,469 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Validated and populated state for tool request (11.587 ms)
galaxy.tools.actions INFO 2019-02-01 14:30:20,500 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Handled output named output1 for tool testing (2.334 ms)
galaxy.tools.actions INFO 2019-02-01 14:30:20,507 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Added output datasets to history (6.210 ms)
galaxy.tools.actions INFO 2019-02-01 14:30:20,513 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Setup for job Job[unflushed,tool_id=testing] complete, ready to flush (5.408 ms)
galaxy.tools.actions INFO 2019-02-01 14:30:20,558 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Flushed transaction for job Job[id=21,tool_id=testing] (44.550 ms)
galaxy.web.stack.transport DEBUG 2019-02-01 14:30:20,559 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Sending message to farm job-handlers: {"__classname__": "JobHandlerMessage", "params": {"task": "setup", "job_id": 21}, "target": "job_handler"}
galaxy.tools.execute DEBUG 2019-02-01 14:30:20,559 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Tool [testing] created job [21] (73.833 ms)
galaxy.web.stack.transport DEBUG 2019-02-01 14:30:20,560 [p:6593,w:0,m:1] [UWSGIFarmMessageTransport.dispatcher_thread] Received message: {"__classname__": "JobHandlerMessage", "params": {"task": "setup", "job_id": 21}, "target": "job_handler"}
galaxy.tools.execute DEBUG 2019-02-01 14:30:20,562 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Executed 1 job(s) for tool testing request: (91.901 ms)
galaxy.web.stack.transport DEBUG 2019-02-01 14:30:20,580 [p:6593,w:0,m:1] [UWSGIFarmMessageTransport.dispatcher_thread] Released lock
galaxy.web.stack.transport DEBUG 2019-02-01 14:30:20,580 [p:6594,w:0,m:2] [UWSGIFarmMessageTransport.dispatcher_thread] Acquired message lock, waiting for new message
galaxy.jobs.rules.map_resources INFO 2019-02-01 14:30:21,526 [p:6593,w:0,m:1] [JobHandlerQueue.monitor_thread] returning destination: slurm
galaxy.jobs.mapper DEBUG 2019-02-01 14:30:21,527 [p:6593,w:0,m:1] [JobHandlerQueue.monitor_thread] (21) Mapped job to destination id: slurm
galaxy.jobs.handler DEBUG 2019-02-01 14:30:21,532 [p:6593,w:0,m:1] [JobHandlerQueue.monitor_thread] (21) Dispatching to slurm runner
galaxy.jobs DEBUG 2019-02-01 14:30:21,539 [p:6593,w:0,m:1] [JobHandlerQueue.monitor_thread] (21) Persisting job destination (destination id: slurm)
galaxy.jobs DEBUG 2019-02-01 14:30:21,565 [p:6593,w:0,m:1] [JobHandlerQueue.monitor_thread] (21) Working directory for job is: /srv/galaxy/jobs/000/21
galaxy.jobs.runners DEBUG 2019-02-01 14:30:21,579 [p:6593,w:0,m:1] [JobHandlerQueue.monitor_thread] Job [21] queued (44.601 ms)
galaxy.jobs.handler INFO 2019-02-01 14:30:21,585 [p:6593,w:0,m:1] [JobHandlerQueue.monitor_thread] (21) Job dispatched
galaxy.jobs.command_factory INFO 2019-02-01 14:30:21,714 [p:6593,w:0,m:1] [SlurmRunner.work_thread-2] Built script [/srv/galaxy/jobs/000/21/tool_script.sh] for tool command [echo "Running with '${GALAXY_SLOTS:-1}' threads" > "/data/000/dataset_21.dat"]
galaxy.jobs.runners DEBUG 2019-02-01 14:30:21,791 [p:6593,w:0,m:1] [SlurmRunner.work_thread-2] (21) command is: rm -rf working; mkdir -p working; cd working; /srv/galaxy/jobs/000/21/tool_script.sh; return_code=$?; cd '/srv/galaxy/jobs/000/21';
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True
python "/srv/galaxy/jobs/000/21/set_metadata_FVs2H3.py" "/srv/galaxy/jobs/000/21/registry.xml" "/srv/galaxy/jobs/000/21/working/galaxy.json" "/srv/galaxy/jobs/000/21/metadata_in_HistoryDatasetAssociation_21_LeYyR2,/srv/galaxy/jobs/000/21/metadata_kwds_HistoryDatasetAssociation_21_2yXqbX,/srv/galaxy/jobs/000/21/metadata_out_HistoryDatasetAssociation_21_WwKZCq,/srv/galaxy/jobs/000/21/metadata_results_HistoryDatasetAssociation_21_nfKGx4,/data/000/dataset_21.dat,/srv/galaxy/jobs/000/21/metadata_override_HistoryDatasetAssociation_21_xsOoSa" 5242880; sh -c "exit $return_code"
galaxy.jobs.runners.drmaa DEBUG 2019-02-01 14:30:21,821 [p:6593,w:0,m:1] [SlurmRunner.work_thread-2] (21) submitting file /srv/galaxy/jobs/000/21/galaxy_21.sh
galaxy.jobs.runners.drmaa INFO 2019-02-01 14:30:21,843 [p:6593,w:0,m:1] [SlurmRunner.work_thread-2] (21) queued as 19
galaxy.jobs DEBUG 2019-02-01 14:30:21,843 [p:6593,w:0,m:1] [SlurmRunner.work_thread-2] (21) Persisting job destination (destination id: slurm)
galaxy.jobs.runners.drmaa DEBUG 2019-02-01 14:30:22,158 [p:6593,w:0,m:1] [SlurmRunner.monitor_thread] (21/19) state change: job is queued and active
galaxy.jobs.runners.drmaa DEBUG 2019-02-01 14:30:23,171 [p:6593,w:0,m:1] [SlurmRunner.monitor_thread] (21/19) state change: job is running
```
]

---
class: smaller, left
# Successful job lifecycle

Note `[p:PPPP,w:W,m:M]` in log messages:

```console
galaxy.tools.execute DEBUG 2019-02-01 14:30:20,559 [p:6590,w:1,m:0] [uWSGIWorker1Core2] Tool [testing] created job [21] (73.833 ms)
galaxy.jobs.mapper DEBUG 2019-02-01 14:30:21,527 [p:6593,w:0,m:1] [JobHandlerQueue.monitor_thread] (21) Mapped job to destination id: slurm
```

- `[p:6590,w:1,m:0]`: PID 6590, web worker, not a mule (0)
- `[p:6593,w:0,m:1]`: PID 6593, not a web worker (0), mule 1