View markdown source on GitHub

Network Analysis with Heinz

Contributors

Questions

Objectives

last_modification Last modification: Jul 26, 2021

Metagenomics / metatranscriptomics

sample collection from environment to sequence data. DNA is isolated, manipulated, ligated together with a BAC vector, before a library is constructued and the expression of different proteins analysed in a live sample.

Speaker Notes There are several advantages in sequencing all DNA (metagenomics) or RNA (metatranscriptomics) in a sample to consider the microbial content.

Many bacterial species are not easily grown on a petri-dish, and may therefore not be observed with classic analysis techniques.

Metatranscriptomics can also reveal the functional state of the microbiome, suggesting which functional pathways are switched on.


The human microbiome

.image-50[ image of a human with lines leading from various points on the body to pie charts of bacteria distribution]

.footnote[Sender et al 2016, PLoS ONE]

Speaker Notes

[*] this figure is difficult to estimate, as we need to consider when two orthologous genes can be considered different. A rough estimate would be 500~10000 species of bacteria with a median of 2000 genes.


Environments related to health

.pull-left[

Picture of someone's mouth, showing their teeth and into their oral cavity and throat. ]

.pull-right[ Nature cover "our other genome" ] —

What would we find?

.pull-left[

.pull-right[ Picture of someone's mouth, showing their teeth and into their oral cavity and throat. ]


Some bacterial species are difficult to culture

petri dish with bacterial cultures


Metagenomics vs 16S

.pull-left[ Full shotgun sequencing of sample:

Sequence ribosomal regions (16S)

]

.pull-right[

.image-75[ shotgun sequencing illustrated with jigsaw puzzles, severla individual puzzles are mixed together. ]

.image-75[ amplicon sequencing illustrated with jigsaw puzzles, all of the corner pieces of each puzzle are circled.]

] Speaker Notes For the workflow considered here we assume RNAseq of the microbiome.

Puzzle analogy: think of each organism as a jigsaw puzzle


16S ribosomal DNA

overview of the bacterial genome

Speaker Notes 16S sequencing is still very popular to estimate the diversity of a microbiome. In this work we are also interested in the functional content.


Gum disease (priodontitis)

journal article "Metatranscriptomics of the human oral microbiome during health and disease"

Speaker Notes For reference only


Microbiome

oral microbiome

Speaker Notes

When we work with metagenomics data there are two main aims 1) profiling of the species content of the microbiome 2) a functional analysis of the microbiome.


MetaModules

metamodules publication

Speaker Notes Here we consider the “metaModules” workflow designed by Ali May et a.


Exploring the networks of microbial function that are involved in disease

A gigantic blob of nodes and lines

Speaker Notes The key idea is to explore functional networks of the microbiome that are involved in disease


Oral disease

.pull-left[

]

.pull-right[

.image-75[ dental caries]

.image-75[ periodontal disease]

]


Healthy vs Disease

comparing microbiomes from healthy and disease samples. Both are sequenced, and then pie charts showing different microbiomes.

Speaker Notes The first part of the workflow entail RNA sequencing of oral microbiome samples.

From this we get the total RNA present, from all species, including ribosomal RNA.

This can be used to estimate both a taxonomic profile, and a functional profile.

With the workflow presented here, we go one step further, and consider which subnetworks in the metabolic pathways are most deregulated in the microbiome samples showing disease (caries).


Using knowledge to pinpoint the key players

The healthy and diseased mouths are sequenced, mapped to produce gene counts, these are binned for KO counts, and differential expression analysis produces the p-values of KO Fold changes. The p-values are converted to KO weights, and these are mapped to KO pathways to produce a network with weighted KOs, and maximum scoring subnetworks are extracted.

Speaker Notes

So, now to the experimental design and analysis steps. We used a previously published metatranscriptomic dataset of microbial RNA reads from healthy and caries disease samples.

We did some preprocessing to filter out non-coding RNA etc, and mapped the remaining reads to a KEGG database of 3 million genes.

We calculated the gene counts in healthy and diseased samples and aggregated these gene counts in KOs.

This was followed b differential expression analysis at the KO level, from which we derived fold changes and P-values for KOs.

We use the P-values to obtain scores for each KO, use the scores to annotate the network and finally run an algorithm in the network to identify the maximum-scoring regions.


Some (KEGG database) terminology

KEGG species view

Speaker Notes But let’s talk briefly about how we define these molecular functions first and get familiar with the terminology. In the database we use, KEGG, we have species and corresponding molecular pathways, like in this case S. Mutans and glycolysis.


Some (KEGG database) terminology

KEGG species view

Speaker Notes Of course, each species has its own specific metabolic network - even though we may not have the information on all specific species in a sample.


Some (KEGG database) terminology

KEGG KO groups showing pathways from Species 1 and species 2. Different portions of the pathway are highlighted

Speaker Notes Now, as orthologous genes in the different species perform the same function, we can collapse the metabolic network of all known bacterial species in to a single network. Here is node represents a KO: a KEGG Orthologous Group, containing orthologous genes performing the same metabolic function in the metabolic network. We use this network of KOs to analyse the microbial sample.


Functional signatures of microbiome-related disease

The most common objective is to identify differentially expressed pathways by performing comparative metagenomic/metatranscriptomic analyses.

funtional signatures of healthy and disease samples

Speaker Notes To understand the functional differences between two conditions, for instance in microbiome-related disease and health, what is commonly done is to look at differentially expressed pathways between the two conditions.

The figure on left shows the pathways that are significantly over- or under-abundant in diseased and healthy patients.

On the right is a figure from another study where researchers listed some important pathways and the KOs in these pathways that are up- (blue) or downregulated (red) in disease.


pathway analysis of a network image

Speaker Notes We can now visualise the RNAseq data on our collapsed microbial metabolic network of KOs. The red nodes represent KOs that are overexpressed in disease, and the blue node those that are underexpressed in disease as compared to healthy samples.

It is immediately clear that the data contains many overexpressed genes, and it is difficult to immediately pinpoint the most deregulated part of the network. For this purpose, we use the Heinz methods to extract de most deregulated subnetwork.


Identifying signatures of disease

the large messy graph from before is shown in the dark, with a torch shining a light on a subset of that graph.


Step 1: Construct a global network

369 pathway maps are reduced into 6642 KOs and 35405 interactions in a giant messy graph.

Speaker Notes

The first step is to construct the global KO network, in the pratical this global network is already provided. As are the processed RNAseq values per KO.


Step 2: Annotate the network with weights

A network of KOs are shown with weights on each node.

Speaker Notes

Next we need to convert the p-values provided by deseq for under and over- expression into weights. Where a positive weight means the KO is significantly deregulated.


Step 3: Calculate the maximum-scoring connected subnetwork

.pull-left[

]

.pull-right[

maximum scoring subnetwork ]

Speaker Notes

Subsequently, the Heinz algorithm can find the subnetwork that is most deregulated.


Differential Expression Analysis

differential expression analysis plot

Each dot’s (function’s) change in expression corresponds to a P-value


P-value distribution when the null hypothesis is true

distribution of p-values

Speaker Notes First, we need to convert the p-values to weight. Here we think about a p-value distribution on a set that has no signal. In this case we expect a uniform distribution of p-values between 0 and 1. Note that if you generate many p-values for a data set, it is always helpful to check the distribution to see if there is any signal in the data.


P-value distribution: signal vs noise

Beta uniform mixture (BUM) model

Noise (blue) and signal (red) in an expected P-value distribution Beta-uniform mixture model (mixing a beta and a uniform distribution )

Speaker Notes If there is a significant part of the data deregulated, we expect a peak on the left hand side of the distribution, i.e. an abundance of very low p-value as compared to the uniform distribution. This figure is very helpful, as we can estimate the number of true positives in the data: the part of the peak that rises above the uniform distribution.


P-value to weight transformation

A graph comparing p-value to density with four regions labelled, A, B, C, and D. A is the region with p less than 0.1 and density greater than 0.5. C is the region directly below, with low p-value but low density. D is the region of higher p-value (0.1-1.0) but low density, and B is the final region of higher p-value and higher density.

\[FDR = \frac{C}{A+C}\]

Speaker Notes FP = false positives, TP = true positives

From the true positives, false positives and false negatives we can calculate the FDR (false discovery rate). Note that this is generally true for p-value distribution, and not specific for expression data.


False discovery rate (FDR) sets the threshold P-value

the same graph as before, caption repeats. A graph comparing p-value to density with four regions labelled, A, B, C, and D. A is the region with p less than 0.1 and density greater than 0.5. C is the region directly below, with low p-value but low density. D is the region of higher p-value (0.1-1.0) but low density, and B is the final region of higher p-value and higher density.

Speaker Notes From this we can calculate the FDR, and we can use an FDR threshold to determine which p-value should correspond the a zero weight.


Result: Weight-annotated network

weight-annotated network image

red: weight > 0, grey: weight < 0

Speaker Notes Now we display the transformed weight onto the global KO network. Note that it is still difficult to spot the most deregulated subnetwork by eye.


Finding the subnetwork

using Heinz to find subnetworks

.footnote[Dittrich,M.T. et al. (2008) Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics, 24, I223–I231]

Speaker Notes Lastly, we use Heinz to find this deregulated subnetwork.


Result: gum disease subnetwork

7 pathways, 39 KOs (32 up- and 7 downregulated in disease)

results subnetworks

.footnote[ Dinkla,K. et al. (2014) eXamine: exploring annotated modules in networks. BMCBioinformatics, 15, 201.]

Speaker Notes

Here we display the results not for a caries dataset, but a gum-disease data set. We will look at the caries data set in the practical.


Butanoate metabolism in gum disease

.pull-left[

.image-50[results subnetworks]

Green: Butanoate metabolism

.image-75[ butanoate metabolic pathway]

]

.pull-right[ Niederman et al., 1997:butyric acid concentrations associated significantly with disease severity. Taken together, these data suggest that butyric acid plays a mediating role in periodontal disease pathogenesis.

Chang et al., 2013:butyrate generated by periodontal pathogens may be involved in the pathogenesis of periodontal diseases via the induction of ROS production and the impairment of cell growth, cell cycle progression and expression of cell cycle-related genes in GFs. ]

Speaker Notes Some of the pathways found in the deregulated subnetwork have previously been associated with gum-disease, but are here found by simply analysing high-throughput data.


Sulfur metabolism in gum disease

.pull-left[

.image-50[ results subnetworks]

Coral: Sulfur metabolism

.image-50[ sulfur metabolic pathway]

]

.pull-right[

Langendijk et al 2001: Sulfate-reducing bacteria (SRB) may be etiologically involved in destructive periodontal diseases. These strictly anaerobic bacteria utilize fermentation products for energy conservation by reduction of sulfate to sulfide. This toxic product can accumulate in periodontal pockets in concentrations causing cellular destruction. SRB depend on an actively degrading microbiota to produce a reduced environment, fermentation products and sulfate. The detection frequency of these bacteria is strongly increased in periodontitis compared with healthy sites in the oral cavity.

publication on role of sulfure in periodontitis

]


Potentially interesting pathways

Also some potentially interesting novel pathways have been found.

A large graph shown mostly in dark grey, several sub-graphs are highlighted in colour.


Real biology vs. 500 random networks

500 networks with shuffled weights were produced, a graph shows the subnetworks from networks with shuffled weights having low cumulative subnetwork scores, while the original subnetwork has a high score.

Speaker Notes Lastly, it is possible to check how significant the deregulated subnetwork itself is. This goes beyond the scope of the practical work.


Practical


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.