Network Analysis with Heinz

name: inverse
layout: true
class: center, middle, inverse

</span></div>

</span></div>

---

# Network Analysis with Heinz

<div class="contributors-line">
		Authors: 
<a href="/training-material/hall-of-fame/sanneabeln/" class="contributor-badge contributor-sanneabeln"><img src="https://avatars.githubusercontent.com/sanneabeln?s=27" alt="Avatar">Sanne Abeln</a>

<a href="/training-material/hall-of-fame/cicozhang/" class="contributor-badge contributor-cicozhang"><img src="https://avatars.githubusercontent.com/cicozhang?s=27" alt="Avatar">Chao (Cico) Zhang</a>

</div>

<div class="footnote" style="bottom: 8em;"><i class="far fa-calendar" aria-hidden="true"></i><span class="visually-hidden">last_modification</span> Updated: Jul 26, 2021</div>

<div class="footnote" style="bottom: 6em;">

<i class="fas fa-file-alt" aria-hidden="true"></i><span class="visually-hidden">text-document</span><a href="slides-plain.html"> Plain-text slides</a>
</div>

<div class="footnote" style="bottom: 2em;">
    <strong>Tip: </strong>press <kbd>P</kbd> to view the presenter notes
    | <i class="fa fa-arrows" aria-hidden="true"></i><span class="visually-hidden">arrow-keys</span> Use arrow keys to move between slides

</div>

???
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press `P` again to switch presenter notes off

Press `C` to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.

Useful when presenting.

---

## Requirements

Before diving into this slide deck, we recommend you to have a look at:

- [Introduction to Galaxy Analyses](/training-material/topics/introduction)

- [Sequence analysis](/training-material/topics/sequence-analysis)
    - Quality Control: [<i class="fab fa-slideshare" aria-hidden="true"></i><span class="visually-hidden">slides</span> slides](/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html) - [<i class="fas fa-laptop" aria-hidden="true"></i><span class="visually-hidden">tutorial</span> hands-on](/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html)
    - Mapping: [<i class="fab fa-slideshare" aria-hidden="true"></i><span class="visually-hidden">slides</span> slides](/training-material/topics/sequence-analysis/tutorials/mapping/slides.html) - [<i class="fas fa-laptop" aria-hidden="true"></i><span class="visually-hidden">tutorial</span> hands-on](/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html)

---

### <i class="far fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions

- What is metatranscriptomics

- Which pathways are potentially contributing to dental caries?

---

### <i class="fas fa-bullseye" aria-hidden="true"></i><span class="visually-hidden">objectives</span> Objectives

- Learn some background knowledge about microbiome

- Understand metatranscriptomics

- Familiarize yourself with KEGG terminology

- Understand how network(pathway) analysis works

---

# Metagenomics / metatranscriptomics

- Sequence all DNA/RNA in a sample from a specific microbial environment
- Get an overview of the entire microbial community

![sample collection from environment to sequence data. DNA is isolated, manipulated, ligated together with a BAC vector, before a library is constructued and the expression of different proteins analysed in a live sample.](../../images/heinz/sampleextraction.png)

???
There are several advantages in sequencing all DNA (metagenomics) or RNA (metatranscriptomics) in a sample to consider the microbial content.

Many bacterial species are not easily grown on a petri-dish, and may therefore not be observed with classic  analysis techniques.

Metatranscriptomics can also reveal the functional state of the microbiome, suggesting which functional pathways are switched on.

---

# The human microbiome

- Over 100 trillion microbes
- Outnumber our human cells 1.3 to 1.
- Include bacteria, fungi, protozoa, viruses, in and outside our body.
- The number of genes in all the microbes in one person's microbiome is ~10-200 times the number of genes in the human genome.
- The microbiome may weigh as much as 200 grams

.image-50[ ![image of a human with lines leading from various points on the body to pie charts of bacteria distribution](../../../metagenomics/images/human_microbiome.png)]

.footnote[[Sender et al 2016, PLoS ONE](https://doi.org/10.1371/journal.pbio.1002533)]

???

[*] this figure is difficult to estimate, as we need to consider when two orthologous genes can be considered different. A rough estimate would be 500~10000 species of bacteria with a median of 2000 genes.

---

# Environments related to health

.pull-left[
- Gut / bowel / intestine
- Oral Cavity
- Nasal Cavity
- Skin

![Picture of someone's mouth, showing their teeth and into their oral cavity and throat.](../../images/heinz/oralcavity.jpg)
]

.pull-right[
![Nature cover "our other genome"](../../images/heinz/othergenome.png)
]
---

# What would we find?

.pull-left[
- Oral Cavity
- DNA from:
  - bacteria
  - viruses
  - human
  - archaea
  - spider (?!)
]

.pull-right[
![Picture of someone's mouth, showing their teeth and into their oral cavity and throat.](../../images/heinz/oralcavity.jpg)
]

---

## Some bacterial species are difficult to culture

![petri dish with bacterial cultures](../../images/heinz/petridish.png)

- Such bacteria may not be found in current reference databases
- The only way to obtain each species may be through direct sequencing of environmental samples
- Purpose: find novel species, genomes, and genes

---

# Metagenomics vs 16S

.pull-left[
**Full shotgun sequencing of sample:**
 - sequence all available DNA/ RNA
 - may need to sequence very deep
 - functional composition

**Sequence ribosomal regions (16S)**
 - need amplicons
 - requires less sequencing
 - taxonomic composition

]

.pull-right[

.image-75[ ![shotgun sequencing illustrated with jigsaw puzzles, severla individual puzzles are mixed together.](../../../metagenomics/images/puzzles-shotgun.png) ]

.image-75[ ![amplicon sequencing illustrated with jigsaw puzzles, all of the corner pieces of each puzzle are circled.](../../../metagenomics/images/puzzles-amplicon.png)]

]
???
For the workflow considered here we assume RNAseq of the microbiome.

Puzzle analogy: think of each organism as a jigsaw puzzle

- shotgun: we considers all pieces (DNA) from all puzzles (organisms); we throw the pieces in a big pile, and need to reconstruct each individual puzzle from this. It is more complex, but you get full details (functional information) about the picture on each puzzle.

- amplicon: we only look at the corner pieces, these are easy to spot, and all puzzles have them. It will not tell us all the details (e.g. functional information) of the picture on the puzzle, but might be enough to tell us whether it is a landscape or a portrait or abstract art for instance (taxonomy).

---

# 16S ribosomal DNA

![overview of the bacterial genome](../../images/heinz/16Sgene.png)

- Multiple copies of **16S, 23S** and **5S**
- 16S: 9 hypervariable & 9 highly conserved regions

???
16S sequencing is still very popular to estimate the diversity of a microbiome. In this work we are also interested in the functional content.

---

# Gum disease (priodontitis)

![journal article "Metatranscriptomics of the human oral microbiome during health and disease"](../../images/heinz/abstract.png)

???
For reference only

---

# Microbiome

![oral microbiome](../../images/heinz/oralmicrobiome.png)

???

When we work with metagenomics data there are two main aims
1) profiling of the species content of the microbiome
2) a functional analysis of the microbiome.

---

# MetaModules

![metamodules publication](../../images/heinz/metamodules.png)

???
Here we consider the “metaModules” workflow designed by Ali May et a.

---

# Exploring the networks of microbial function that are involved in disease

![A gigantic blob of nodes and lines](../../images/heinz/network.png)

???
The key idea is to explore functional  networks of the microbiome that are involved in disease

---

# Oral disease

.pull-left[
- **Dental caries**
  The most common chronic disease of children (25%), and adolescents (59%). Bacteria feed on food debris. Bacteria produce acids leading to demineralization and destruction of the teeth.

- **Periodontal disease**
  Bacteria living in tissues surrounding and supporting the teeth; An over-aggressive immune response can subsequently lead to loosening and finally loss of teeth.

]

.pull-right[

.image-75[ ![dental caries](../../images/heinz/caries.jpg)]

.image-75[ ![periodontal disease](../../images/heinz/periodontaldisease.jpg)]

]

---

# Healthy vs Disease

![comparing microbiomes from healthy and disease samples. Both are sequenced, and then pie charts showing different microbiomes.](../../images/heinz/healthyvsdisease.png)

???
The first part of the workflow entail RNA sequencing of oral microbiome samples.

From this we get the total RNA present, from all species, including ribosomal RNA.

This can be used to estimate both a taxonomic profile, and a functional profile.

With the workflow presented here, we go one step further, and consider which subnetworks in the metabolic pathways are most deregulated in the microbiome samples showing disease (caries).

---

## Using knowledge to pinpoint the key players

![The healthy and diseased mouths are sequenced, mapped to produce gene counts, these are binned for KO counts, and differential expression analysis produces the p-values of KO Fold changes. The p-values are converted to KO weights, and these are mapped to KO pathways to produce a network with weighted KOs, and maximum scoring subnetworks are extracted.](../../images/heinz/workflow.png)

???

So, now to the experimental design and analysis steps. We used a previously published metatranscriptomic dataset of microbial RNA reads from healthy and caries disease samples.

We did some preprocessing to filter out non-coding RNA etc, and mapped the remaining reads to a KEGG database of 3 million genes.

We calculated the gene counts in healthy and diseased samples and aggregated these gene counts in KOs.

This was followed b differential expression analysis at the KO level, from which we derived fold changes and P-values for KOs.

We use the P-values to obtain scores for each KO, use the scores to annotate the network and finally run an algorithm in the network to identify the maximum-scoring regions.

---

# Some (KEGG database) terminology

![KEGG species view](../../images/heinz/kegg1.png)

???
But let’s talk briefly about how we define these molecular functions first and get familiar with the terminology. In the database we use, KEGG, we have species and corresponding molecular pathways, like in this case S. Mutans and glycolysis.

---

# Some (KEGG database) terminology

![KEGG species view](../../images/heinz/kegg1-2.png)

???
Of course, each species has its own specific metabolic network - even though we may not have the information on all specific species in a sample.

---

# Some (KEGG database) terminology

![KEGG KO groups showing pathways from Species 1 and species 2. Different portions of the pathway are highlighted](../../images/heinz/kegg2.png)

???
Now, as orthologous genes in the different species perform the same function, we can collapse the metabolic network of all known bacterial species in to a single network. Here is node represents a KO: a KEGG Orthologous Group, containing orthologous genes performing the same metabolic function in the metabolic network. We use this network of KOs to analyse the microbial sample.

---

# Functional signatures of microbiome-related disease

The most common objective is to identify differentially expressed pathways by performing comparative metagenomic/metatranscriptomic analyses.

![funtional signatures of healthy and disease samples](../../images/heinz/functionalsignatures.png)

???
To understand the functional differences between two conditions, for instance in microbiome-related disease and health, what is commonly done is to look at differentially expressed pathways between the two conditions.

The figure on left shows the pathways that are significantly over- or under-abundant in diseased and healthy patients.

On the right is a figure from another study where researchers listed some important pathways and the KOs in these pathways that are up- (blue) or downregulated (red) in disease.

---

![pathway analysis of a network image](../../images/heinz/globalapproach.png)

???
We can now visualise the RNAseq data on our collapsed microbial metabolic network of KOs. The red nodes represent KOs that are overexpressed in disease, and the blue node those that are underexpressed in disease as compared to healthy samples.

It is immediately clear that the data contains many overexpressed genes, and it is difficult to immediately pinpoint the most deregulated part of the network. For this purpose, we use the Heinz methods to extract de most deregulated subnetwork.

---

# Identifying signatures of disease

- Where are the most important regions in this large network?
- How can we find such regions?

![the large messy graph from before is shown in the dark, with a torch shining a light on a subset of that graph.](../../images/heinz/identification.png)

---

# Step 1: Construct a global network

![369 pathway maps are reduced into 6642 KOs and 35405 interactions in a giant messy graph.](../../images/heinz/globalnetwork.png)

???

The first step is to construct the global KO network, in the pratical this global network is already provided. As are the processed RNAseq values per KO.

---

# Step 2: Annotate the network with weights

![A network of KOs are shown with weights on each node.](../../images/heinz/network_heinz.png)

???

Next we need to convert the p-values provided by deseq for under and over- expression into weights. Where a positive weight means the KO is significantly deregulated.

---

# Step 3: Calculate the maximum-scoring connected subnetwork

.pull-left[

- Problem: given network with weights find the maximum-scoring subnetwork

- NP-hard problem.

- Solutions are feasible in biological networks.

- An exact solver exists: Heinz [Dittrich et al., 2008].

]

.pull-right[

![maximum scoring subnetwork](../../images/heinz/network_heinz2.png)
]

???

Subsequently, the Heinz algorithm can find the subnetwork that is most deregulated.

---

# Differential Expression Analysis

![differential expression analysis plot](../../images/heinz/dea.png)

Each dot's (function's) change in expression corresponds to a P-value

---

# P-value distribution when the null hypothesis is true

![distribution of p-values](../../images/heinz/histogram.png)

???
First, we need to convert the p-values to weight. Here we think about a p-value distribution on a set that has no signal. In this case we expect a uniform distribution of p-values between 0 and 1. Note that if you generate many p-values for a data set, it is always helpful to check the distribution to see if there is any signal in the data.

---

# P-value distribution: signal vs noise

![Beta uniform mixture (BUM) model](../../images/heinz/bum.png "P-value histogram of 7105 KO fold changes")

<span style="color:blue">Noise (blue)</span> and <span style="color:red">signal (red)</span> in an expected P-value distribution
Beta-uniform mixture model (mixing a beta and a uniform distribution )

???
If there is a significant part of the data deregulated, we expect a peak on the left hand side of the distribution, i.e. an abundance of very low p-value as compared to the uniform distribution. This figure is very helpful, as we can estimate the number of true positives in the data: the part of the peak that rises above the uniform distribution.

---

# P-value to weight transformation

![A graph comparing p-value to density with four regions labelled, A, B, C, and D. A is the region with p less than 0.1 and density greater than 0.5. C is the region directly below, with low p-value but low density. D is the region of higher p-value (0.1-1.0) but low density, and B is the final region of higher p-value and higher density. ](../../images/heinz/pvalue-to-weight.png)

$$ FDR = \frac{C}{A+C}$$

- **A:** True positives, **C:** False positives
- False discovery rate (**FDR**) -> a threshold P-value

???
FP = false positives, TP = true positives

From the true positives, false positives and false negatives we can calculate the FDR (false discovery rate). Note that this is generally true for p-value distribution, and not specific for expression data.

---

# False discovery rate (FDR) sets the  threshold P-value

![the same graph as before, caption repeats. A graph comparing p-value to density with four regions labelled, A, B, C, and D. A is the region with p less than 0.1 and density greater than 0.5. C is the region directly below, with low p-value but low density. D is the region of higher p-value (0.1-1.0) but low density, and B is the final region of higher p-value and higher density.](../../images/heinz/pvalue-to-weight.png "p-value histogram of 7105 KO fold")

- The threshold P-value (τ) -> KOs with + and - weights
- FDR determines the number of positive-scoring nodes and therefore the size of the final subnetwork.
- In this dataset, an **FDR=0.0007 corresponded to τ = 10^5**

???
From this we can calculate the FDR, and we can use an FDR threshold to determine which p-value should correspond the a zero weight.

---

# Result: Weight-annotated network

![weight-annotated network image](../../images/heinz/network_weight_annotated.png)

<span style="color:red">red:</span> weight &gt; 0, <span style="color:grey">grey:</span> weight &lt; 0

???
Now we display the transformed weight onto the global KO network. Note that it is still difficult to spot the most deregulated subnetwork by eye.

---

# Finding the subnetwork

![using Heinz to find subnetworks](../../images/heinz/find_subnetwork.png)

.footnote[Dittrich,M.T. et al. (2008) Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics, 24, I223–I231]

???
Lastly, we use Heinz to find this deregulated subnetwork.

---

# Result: gum disease subnetwork

7 pathways, 39 KOs (32 up- and 7 downregulated in disease)

![results subnetworks](../../images/heinz/results.png)

.footnote[ Dinkla,K. et al. (2014) eXamine: exploring annotated modules in networks. BMCBioinformatics, 15, 201.]

???

Here we display the results not for a caries dataset, but a gum-disease data set. We will look at the caries data set in the practical.

---

# Butanoate metabolism in gum disease

.pull-left[

.image-50[![results subnetworks](../../images/heinz/results.png)]

<span style="color:green">Green:</span> Butanoate metabolism

.image-75[ ![butanoate metabolic pathway](../../images/heinz/butanoate.png)]

]

.pull-right[
<span style="font-size: small">
*Niederman et al., 1997:*
...**butyric acid concentrations** associated significantly with **disease severity**. Taken together, these data suggest that **butyric acid plays a mediating role in periodontal disease pathogenesis.**
</span>

<span style="font-size: small">
*Chang et al., 2013:*
…**butyrate** generated by periodontal pathogens may be involved in the pathogenesis of **periodontal diseases** via the induction of ROS production and the impairment of cell growth, cell cycle progression and expression of cell cycle-related genes in GFs.
</span>
]

???
Some of the pathways found in the deregulated subnetwork have previously been associated with gum-disease, but are here found by simply analysing high-throughput data.

---

# Sulfur metabolism in gum disease

.pull-left[

.image-50[ ![results subnetworks](../../images/heinz/results.png)]

<span style="color:coral">Coral:</span> Sulfur metabolism

.image-50[ ![sulfur metabolic pathway](../../images/heinz/sulfur.png)]

]

.pull-right[

<span style="font-size: small">
*Langendijk et al 2001*: **Sulfate-reducing bacteria** (SRB) may be etiologically involved in destructive **periodontal diseases**. These strictly anaerobic bacteria utilize fermentation products for energy conservation by reduction of sulfate to sulfide. This toxic product can accumulate in periodontal pockets in concentrations causing cellular destruction. SRB depend on an actively degrading microbiota to produce a reduced environment, fermentation products and sulfate. The detection frequency of these bacteria is **strongly increased in periodontitis** compared with healthy sites in the oral cavity.
</span>

![publication on role of sulfure in periodontitis](../../images/heinz/sulfur_publication.png)

]

---

# Potentially interesting pathways

Also some potentially interesting novel pathways have been found.

![A large graph shown mostly in dark grey, several sub-graphs are highlighted in colour.](../../images/heinz/novelpathways.png)

---

# Real biology vs. 500 random networks

![500 networks with shuffled weights were produced, a graph shows the subnetworks from networks with shuffled weights having low cumulative subnetwork scores, while the original subnetwork has a high score.](../../images/heinz/random-vs-real.png)

???
Lastly, it is possible to check how significant the deregulated subnetwork itself is. This goes beyond the scope of the practical work.

---

# Practical

- Work on caries data set for a network analysis
- Get a feel of running each tool, mentally map the parameters/options back to the method mentioned in this presentation.
- Save all the steps into a workflow and rerun it.

---
### <i class="fas fa-key" aria-hidden="true"></i><span class="visually-hidden">keypoints</span> Key points

- Analyzing differential expression for (meta)transcriptomics data

- Validating the p-value distribution of the differential expression analysis

- Finding the most differentially expressed subnetwork in a gene functional network using Heinz

- Interpreting the pinpointed subnetwork

---

## Thank You!

This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!

<a href="/training-material/hall-of-fame/cicozhang/" class="contributor-badge contributor-cicozhang"><img src="https://avatars.githubusercontent.com/cicozhang?s=27" alt="Avatar">Chao (Cico) Zhang</a>

</div>

</div>

<a rel="license" href="https://creativecommons.org/licenses/by/4.0/">
This material is licensed under the Creative Commons Attribution 4.0 International License</a>.