MEBS

Valerie De Anda

19/Oct/2021

About MEBS

Multigenomic Entropy-Based Score or MEBS, allows the user to synthesizes genomic information into a single informative value. This entropy score can be used to infer the likelihood that microbial taxa perform specific metabolic-biogeochemical pathways

Authors

MEBS is designed, created and maintained by:

Valerie De Anda at the University of Texas at Austin
Cesar Augusto Hernandez at the Cellular Physiology Institute of Universidad Nacional Autonoma de Mexico.
Bruno Contrearas Moreira at the Laboratory of Computational Biology at Estacion Experimental de Aula DeiEEAD/CSIC

_{Figure 1. MEBS authors}

MEBS in a nutshell

Relative entropy

How can we organize and simplify complex genomic data to better understand the metabolisms of diverse microbial taxa? This was the major research question behind the development of MEBS. It works by synthesizing multiple sources of data on the metabolism of interest (e.g., genes, microbial taxa, and elemental reactions) under the mathematical framework of the Kullback-Leibler divergence (Equation 1), also known as relative entropy H’. An H’ value is assigned to each protein-coding gene (pcg) based on how often the gene is encoded in the genomes of metabolically similar microbial taxa. H’ values close to 0 indicate that a given pcg is widely distributed among microorganisms (e.g., ABC transporters) and is therefore non-informative of the target metabolism. H’ values near or greater than 1 indicate that a given pcg is unique to a group of metabolically similar microbial taxa, whereas negative H’ values indicate a given pcg is not expected to be involved in the target metabolism. MEBS stores the unique pcgs and their H’ values in an internal databases, which can then be cross-referenced with genomic and metagenomic input data to obtain a single H’ scores for a given metabolic pathway. High H’ scores suggest that microbial taxa can perform the pathways involved in the metabolism of interest.

_{Equation 1. Kullback-Leibler divergence—also known as relative entropy H’ to measure the difference between probabilities P and Q (see Equation (1) below). In this context, P(i) represents the frequency of protein domain i in the genomes of a given metabolism (observed frequency), while Q(i) represents its frequency in a non-redundant set of genomes (expected frequency). The script to compute the entropy (entropy.pl) requires the list of the genomes of interest and the tabular output file obtained from the scanning a set of non-redundant genomes against the a set of protein families involved in the metabolism of interest. The obtained values of H΄ (in bits) capture to what extent a given Pfam domain informs about the metabolism of interest. In this case, domains with H΄ values close to or greater than 1 correspond to the most informative Pfam domains (for example enriched among S-based genomes), whereas low H΄ values (close to 0) indicate non-informative ones. Negative values correspond to those observed less than expected.}

Built-in functions

Currently, the MEBS software has a built-in function that allows users to accurately and quickly evaluate the likelihood that up to thousands of microbial taxa (genomes) or communities (metagenomes) can perform the metabolic reactions involved in C, N, O, S, and Fe cycling. However, a user can also take a more advanced, step-wise, approach to examine additional metabolic reactions (e.g., As cycling). In addition, MEBS includes several tools to help users interpret their results: H’ values of specific metabolic pathways can be mapped on visual reconstructions of microbial phylogenies (Fig. 1C); microbial taxa or entire communities can be sorted by their metabolic function (e.g., nutrient assimilation) (Fig. 1D); or communities filling certain metabolic niches (e.g., heavy metal degradation) can be mapped at global scales (Fig. 1E).

_{Figure 2. Overview of Multigenomic Entropy-Based Score (MEBS) software. A) Iceberg analogy schematic, showing that only a small proportion of metabolic information is typically used to draw ecologically relevant conclusions. Most of the microbial ecology “omic” studies have focused on either: (i) developing a broad description of the metabolic pathways within specific environments, (ii) analyzing the relative abundance of marker genes involved in key metabolic processes, or (iii) discovering differentially abundant, shared or unique, genes, proteins, or metabolic pathways. Approaches that balance biological interpretative power and computational efficiency are needed to evaluate integratively omic-derived data. B) Schematic of MEBS computational workflow. Omic-derived data of genomes, metagenomes, or MAGs (Metagenome Assembled Genomes) are the first input of the MEBS pipeline. The black circles represent the simple steps that users must perform before running MEBS. C) The entropy-based scores derived from MEBS capture the metabolic machinery of biogeochemical cycles (e.g., C, N, O, S, and Fe), which can be mapped on phylogenic trees of microbial taxa. D) Entropy scores can be used by a number of built-in machine learning analyses, such as classification and unsupervised clustering. The summary of protein features in single values may be useful for the discovery of metabolic combinations that are poorly identified by other standard methods. E) Entropy scores for massively distributed genomic datasets can be assembled to examine biogeochemical cycles (e.g., the sulfur cycle) across regional and global domains. F) Entropy scores can be integrated into time-series analysis to monitor fluctuations in the microbial-metabolic biogeochemical cycles in experimental studies under normal or perturbed conditions.}

Applications

MEBS provides an open-access, reproducible, entropy-based platform to efficiently analyze microbial genomic datasets and decode a plethora of metabolic-chemical pathways. The broad application of MEBS stretches across scientific disciplines. MEBS can be used to study microbial responses to on-going threats of anthropogenic and climate-linked change associated with sea level rise, ocean acidification, and land-use by providing new information on shifts in microbial-mediated biogeochemical cycling and bottom-up effects on entire ecosystems. MEBS can identify global environments with microbial taxa that carry the necessary pcgs for the degradation of hydrocarbons, heavy metals, and other pollutants, which lends itself to new bioremediation practices. MEBS can also be used in translational medicine to develop new tests that differentiate groups of infectious, antibiotic-resistant, bacteria in the human lung. In the future, I see MEBS software along with other metabolic decoder approaches integrated into sequencer instruments (e.g., MinION), allowing real-time analysis of complex metabolic pathways in nature

_{Figure 3. MEBS current and future applications}

MEBS main components

This software has been tested in Linux and MacOSX systems. Currently MEBS is composed 3 main scripts, some of them depend of other scripts that are found in the MEBS scrpts directory

mebs.py : Main scripts that compute hmmsearches and entropy scores
mebs_vis.py: Visualization scripts that generates several human readable files for metabolic analysis and downstream applications
mebs_clust.py: Compute clustering analysis based on presence abscence matrix

System requirements and installation:

Hmmsearch

In order to use mebs.pl you only need to have perl and hmmsearch installed

To install hmmsearch see documentation webpage to obtain and compile HMMER from source: http://hmmer.org/documentation.html

brew install hmmer               # OS/X, HomeBrew
port install hmmer               # OS/X, MacPorts
apt install hmmer                # Linux (Ubuntu, Debian...)
dnf install hmmer                # Linux (Fedora)
yum install hmmer                # Linux (older Fedora)
conda install -c bioconda hmmer  # Anaconda

Python3

For the visualization and clustering script you will need to have python >= 3.6.2 and several python modules

numpy
matplotlib
pandas
seaborn

To install the above modules including python3

sudo apt-get install python3             # Linux (Ubuntu, Debian..)
sudo pip3 install -U pip                 # Linux (Ubuntu, Debian..)
sudo -H pip3 install --upgrade pandas    # Linux (Ubuntu, Debian..)
sudo -H pip3 install --upgrade numpy     # Linux (Ubuntu, Debian..)
sudo -H pip3 install --upgrade scipy     # Linux (Ubuntu, Debian..)
sudo -H pip3 install --upgrade seaborn   # Linux (Ubuntu, Debian..)

Conda option

Install conda or miniconda depending of your needs. See conda webpage for more info

Once conda is installed activate conda and use the yml file provided with mebs

For linux users:

conda activate
conda env create -f mebs_env.yml
conda activate mebs_env

Download mebs

git clone https://github.com/valdeanda/mebs.git
cd mebs

Before you start

Main options

Two options are required for MEBS:

Input directory containing fasta files of either metagenomes, genomes or MAGs
Type the type of data. If you have MAGs use the option “genomic”.

By default, mebs will use a False Discovery rate of 0.01. The FDR is the probability that, when you get a ‘significant’ result there is actually a real effect. The FDRs of each cycle are previosly pre-computed and you can find the specific values in the config file. If you want to be very strict considering significative Entropy Scores, you can use a smaller FDR i.e 0.0001. Be aware that the FDR was benchmarked with known cultured data. For example, for Sulfur Entropy Scores >6, it is very likely that those organisms are growing in sulfur compounds. See more details of the Sulfur score benchmarking on the orginal publication of MEBS, figure 5, panel D (De Anda et al., 2017).

perl mebs.pl -h

  Program to compute MEBS for a set of genomic/metagenomic FASTA files in input folder.
  Version: v1.0

  usage: mebs.pl [options]

   -help    Brief help message

   -input   Folder containing FASTA peptide files (.faa)             (required)

   -type    Nature of input sequences, either 'genomic' or 'metagenomic'  (required)

   -fdr     Score cycles with False Discovery Rate 0.1 0.01 0.001 0.0001  (optional, default=0.01)

   -cycles  Show currently supported biogeochemical cycles/pathways

   -comp    Compute the metabolic completeness of default cycles.         (optional)
            Required option  for mebs_output.py

Considerations

To run MEBS, make sure you are inside the mebs directory since the mebs.pl script depends on several files within the mebs directory and if you are trying to run mebs outside this directory it won’t run.
The only input data for MEBS are fasta files with aminoacid sequences with the .faa extension. If your files have .fa, or .fasta extension it won’t run.
hmmsearch must be previosly installed in the path See installations and requirements above. Type the following to make sure everything is up and running.

hmmsearch  -h
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>

Example input

To make sure MEBS is running correctly, please try it first with the example folders provided with genomic and metagenomic fasta files:

Running mebs in the example genomic directory

perl mebs.pl -input gen_test/ -type genomic

Running mebs in the example metagenomic directory

perl mebs.pl -input met_test/ -type metagenomic

Your genomic output should look like this

        # mebs.pl -input gen_test -type genomic -fdr 0.01 -comp 0

    sulfur  carbon  oxygen  iron    nitrogen    markers
Archaeoglobus_profundus_DSM_5631.faa    12.024* 24.795* 2.036   0.911   6.928   171.000*
Enterococcus_durans.faa -0.194  0.349   1.433   0.337   2.652   156.000*
Candidatus_Nitrosopumilus_sp_AR2.faa    3.256   5.377   3.229   2.624   9.807   167.000*
Bradyrhizobium_oligotrophicum_S58.faa   1.211   6.366   8.871*  7.482   19.970* 163.000*
Synechococcus_sp_KORDI-52.faa   2.425   4.877   9.746*  3.460   13.434  161.000*
Dechloromonas_aromatica_RCB.faa 3.941   10.369  8.004*  10.100* 17.793* 162.000*
Desulfohalobium_retbaense_DSM_5692.faa  11.511* 9.170   3.521   4.749   9.052   167.000*
Methanosarcina_barkeriMS.faa    5.930   79.606* 5.469   0.208   10.704  177.000*

Computational time

Tu run MEBS, you don’t need a lot of computational power and it only takes a few min tu run in the genomic dataset. Once the searches are done, the next time you run MEBS in the same input directory will take a few seconds.

In a MackBook Pro 3.3 GHz Dua-Core Intel Core i7, 16 GB RAM

First time running mebs in the input directory

real    1m41.583s
user    2m24.582s
sys   0m8.679s

Second time running mebs in the same input directory, takes just a few seconds to run

real    0m2.728s
user    0m1.781s
sys   0m0.608s

Running MEBS: Searches and Scores

Run the following example to analyze the metabolism and completeness of genomes or MAGs

perl mebs.pl -input /user_dir/path/  -type genomic -comp > user_data.tsv

Your output should look like this. So now let’s extract some information from this file.

        sulfur  carbon  oxygen  iron    nitrogen        markers <sulfur comp>   sulfur_1        sulfur_2        sulfur_3        s
Archaeoglobus_profundus_DSM_5631.faa    12.024* 24.795* 2.036   0.911   6.928   171.000*        59.8    100.0   66.7    80.0    2
Enterococcus_durans.faa -0.194  0.349   1.433   0.337   2.652   156.000*        33.2    50.0    33.3    20.0    25.0    0.0     3
Candidatus_Nitrosopumilus_sp_AR2.faa    3.256   5.377   3.229   2.624   9.807   167.000*        46.4    75.0    100.0   60.0    1
Bradyrhizobium_oligotrophicum_S58.faa   1.211   6.366   8.871*  7.482   19.970* 163.000*        67.9    75.0    0.0     60.0    7
Synechococcus_sp_KORDI-52.faa   2.425   4.877   9.746*  3.460   13.434  161.000*        54.0    75.0    66.7    60.0    37.5    0
Dechloromonas_aromatica_RCB.faa 3.941   10.369  8.004*  10.100* 17.793* 162.000*        71.1    75.0    0.0     60.0    75.0    6
Desulfohalobium_retbaense_DSM_5692.faa  11.511* 9.170   3.521   4.749   9.052   167.000*        65.6    100.0   66.7    80.0    3
Methanosarcina_barkeriMS.faa    5.930   79.606* 5.469   0.208   10.704  177.000*        59.6    50.0    33.3    60.0    25.0    3

After running this command, mebs will generate two types of files for each input *.faa file. Go to the folder where your input files are located, in this example gen_test/ directory and you will see that for each cycle 2 files are generated.

cycle.hmmsearch.tab
cycle.score

As an example, let’s explore the sulfur entropy score files for the genome Desulfohalobium_retbaense_DSM_5692.faa.

less  Desulfohalobium_retbaense_DSM_5692.faa.sulfur.score

# /Users/val/src/Github_repos/mebs/scripts/pfam_score.pl call:
# -input gen_test/Desulfohalobium_retbaense_DSM_5692.faa.sulfur.hmmsearch.tab -size real -bzip 0 -entropyfile cycles/sulfur/entropies.tab -minentropy -9 -random 0 -keggmap  -pathway

# total HMMs with assigned entropy in cycles/sulfur/entropies.tab : 112

# Pfam  entropy #matched_peptides
PF00037 0.088   46
PF00005 -0.001  45
PF12838 0.123   43
PF13187 0.166   38
PF00528 0.035   27
PF00890 0.000   25
PF07992 0.000   24
PF13183 0.124   22
PF00534 0.063   17
PF12831 0.017   13
PF13439 0.103   12
PF12800 0.363   12
PF00501 -0.035  12
PF14697 0.345   11
PF01370 0.041   10
PF00009 -0.001  10
PF04069 -0.249  9
PF13247 0.615   9
...
...
...
# Pfam entropy score: 11.511

This file shows the calculated entropy for each pfam found in your input file and the number of matched pfams in your input. In this case, domains with H΄ values close to or greater than 1 correspond to the most informative Pfam domains (enriched among S-based genomes), whereas low H΄ values (close to 0) indicate non-informative ones. Negative values correspond to those observed less than expected At the bottom of this table the calculated entropy score is shown, which is also in your output.tsv file. This is the sum of the entropy values for the pfams shown. Be aware that the score does not consider the abundance of the pfams, but it just takes into consideration the pre-determined entropy values for those given pfams.

To understand the significance of the pfam entropy score, view this Table with pre-computed Entropy Scores for a non-redundant set of microorganisms described in the original publication of MEBS).Click a cycle to sort by lowest or highest scores. The file containing the normalized-scores for the same dataset are found in the data2vis/ directory

The entropy scores can be analyzed and visualizated as shown below. Please go to the next module for visualization and parsing your results from mebs.pl

_{Figure 4. Visualization options for the entropy scores derived from mebs.pl main script}

Running MEBS: Parsing and visualization

The mebs_vis.p script will use the output generated by mebs.pl and will generate an output directory containing several graphs and files for downstream analysis.

python3 mebs_vis.py -h
usage: mebs_vis.py [-h] [-o OUTDIR] [-im_format {png,pdf,ps,eps,svg,tif,jpg}]
                   [--im_res dpi]
                   filename

 Parse mebs.pl output and creates several files and figures:
-File to map mebs normalized values to itol         => itol_mebs.txt
-File with the metabolic completeness with names    => input+completenes.tab
-File to be the output of F_MEBS_cluster.py -s none => input+2_cluster_mebs.txt
-Heatmap with normalized mebs values                => inputmebs_heatmap.png
-Heatmap with metabolic completness of S and C      => input+comp_heatmap.png
-Barplot with normalized mebs values                => input+barplot.png
-Genomic completeness based on marker genes         => input+genomic_completeness.tab
-Normalized mebs scores                             => input+norm_mebs.tab

positional arguments:
  filename              Input file derived from mebs.pl using -comp option.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        Output folder [<filename>_mebs_vis]
  -im_format {png,pdf,ps,eps,svg,tif,jpg}, -f {png,pdf,ps,eps,svg,tif,jpg}
                        Output format for images [png].
  --im_res dpi, -r dpi  Output resolution for images in dot per inch (dpi)
                        [dpi].

Example:
$  python3 mebs_vis.py gen_test.tsv

The easiest way to run the script is the following:

python3 mebs_vis.py gen_test.tsv

This will generate the following output

[OUTPUT]   Output not specified.
[OUTPUT]   Storing result in : gen_test.tsv_mebs_vis
[OUTPUT]   ... gen_test.tsv_mebs_vis -> does not exists: creating
No handles with labels found to put in legend.
[END]   Done........................
Please check the following files in folder 'gen_test.tsv_mebs_vis'  :
 0. Original data without asteriks: gen_test.tsv_mebs_vis/gen_test.tsv.noa
 1. Heatmap displaying the metabolic completeness of N,Fe,S and CH4 pathways based on pfams: gen_test.tsvcomp_heatmap.png
 2. Barplot with normalized MEBS score values: gen_test.tsv_barplot.png
 3. Heatmap with normalized MEBS score values: gen_test.tsv_mebs_heatmap.png
 4. Dotplot with normalized MEBS score values: gen_test.tsv_mebs_dotplot.png
 5. Completeness file with description of the columns: gen_test.tsv_completenes.tab
 6. Mapping file to itol with normalized MEBS scores: gen_test.tsv_itol_mebs.txt
 7. Mapping file to itol with pfam metabolic completeness: gen_test.tsv_itol_mebs_comp.txt
 8. File to be used as the input of F_MEBS_cluster.py -s none option gen_test.tsv_2_cluster_mebs.txt
 9. Genomic completeness based on single copy marker genes: gen_test.tsv _genomic_completenes.tab
 10. Normalized mebs scores  gen_test.tsv_norm_mebs.tab
  If you have a tree file loaded in  itol, you can drag directly the _itol.txt files into your tree
 and customize the colors of the pathways and the scores as in the following example
 https://itol.embl.de/tree/97981518041461538630153

Note: the -fdr option is not supported at this time. Important note to consider is that if you run mebs.pl specifying an fdr option, the script mebs_vis.py will give you an error. Specifying an fdr value will result in an output .tsv that contains NA values and is not compatible with mebs_vis.py*. The seaborn.clustermap() function will throw a ValueError: The condensed distance matrix must contain only finite values.

Normalized entropy scores

In the mebs_vis output directory you will see 10 files (4 png figures and 6 text files)

gen_test.tsv.noa
gen_test.tsv_2_cluster_mebs.txt
gen_test.tsv_barplot.png
gen_test.tsv_comp_heatmap.png
gen_test.tsv_genomic_completenes.tab
gen_test.tsv_itol_mebs.txt
gen_test.tsv_itol_mebs_comp.txt
gen_test.tsv_mebs_dotplot.png
gen_test.tsv_mebs_heatmap.png
gen_test.tsv_norm_mebs.tab
gen_test.tsv_pfam_completenes.tab

Let’s explore the files:

The files containing the normalized entropy scores are:

gen_test.tsv_norm_mebs.tab (raw normalized entropy scores)
gen_test.tsv_itol_mebs.txt (file to drag to itol to plot the normalized entropy scores)

The aboved mentioned files are used as input to generate the following figures:

gen_test.tsv_mebs_dotplot.png
gen_test.tsv_barplot.png
gen_test.tsv_mebs_heatmap.png
mebs_itol.jpeg. [Be aware] This file is not provided withing the mebs_vis.py script and it has to be generated by the user draging the file number 6 in the list above into itol

_{Figure 5. Dotplot of the normalized scores for 8 example genomes}

_{Figure 6. Barplot of the normalized scores for 8 example genomes}

_{Figure 7. Heatmap of the normalized scores for 8 example genomes}

_{Figure 8. Itol visualization using the normalized scores, see interactive version here}

Metabolic pathway completeness

By using the -comp option, MEBS will use the predetermined mapping files for S,N,C,O and Fe located in the cycles directory, named pfa2kegg.tab
For example for the sulfur pathways the mapping file looks like this.

PFAM    KO      PATHWAY PATHWAY NAME
PF00890 K00394  1       aprAB
PF02910 K00394  1       aprAB
PF12139 K00395  1       aprAB
PF12838         1       aprAB
PF01087         2       apt/sat
PF01747         2       apt/sat
PF14306         2       apt/sat
PF01077 K11180  3       dsrABC
PF03460 K11180  3       dsrABC
PF04358         3       dsrABC
PF00037 K11181  3       dsrABC
PF04358         3       dsrABC
PF02872 K17224  4       Sox system
PF02872 K17224  4       Sox system
PF00034 0       4       Sox system
PF13442 0       4       Sox system

This is the mapping file that will explain the .tsv output from running mebs.pl. For example, “sulfur_1” corresponds to the pfams analyzed for “aprAB”. If your table shows 100% for sulfur_1, this indicates all 4 pfams considered under pathway 1 (aka sulfur_1) are present, and thus aprAB is present.

The output files containing the name of the pathways in a tabular format file with corresponding mapped pathway labels, in other words this file containing the presence/absence profile of certain pfams belonging to either pathways or marker genes for each of the benchmarked cycle

gen_test.tsv_pfam_completenes.tab
gen_test.tsv_itol_mebs_comp.txt (same file to be used to map into itol)

This profile file is used as input to generate the following heatmap.

_{Figure 9. Heatmap displaying the profile abundances of specific protein domains specific to biogeochemical cycles.)}

Genomic completeness

The new module incorporated in MEBS allows you to access the completeness of your genomes or MAGs. The single copy marker genes used in this analysis are found in the mapping file inside the markers directory. MEBS will compute genomic completeness based on single-copy marker genes found in either bacteria, archaea, or both. Currently, MEBS uses a set of marker genes previously described in Gutierrez-Preciado et al., 2018 and a second set of marker genes incorporated in miComplete.

As a comparison, I computed the genomic completeness of the genomes found in the gen_test with three different programs: miComplete and checkM and checkM.

As you can see, in the graph below the MEBs completeness values are comparable to the ones obtained with the other two widely used programs. This provides another method for quality control analysis of genomes and MAGs. MEBS further offers the advantage of computing both genomic completeness and metabolism. Additionally, it is based on the completeness of single-copy marker genes in archaea or bacteria providing a preliminary taxonomic classification ahead of downstream analyses.

_{Figure 10. Comparision of genomic completeness computed with three different methods: mebs,}

Running MEBS: Protein clustering

Protein clustering reduces the dimensionality of your data, which is crucial in the era of large omics datasets. Through this module, you can analyze thousands of genomes through clustering based on their predicted proteins. Thereby, collapsing these genomes into metabolic guilds, clusters, or groups. These guilds will indicate which organisms perform a specific functions, have increased fitness in specific niches, and highlight potential resource partitioning within the interweaved communities. All of which further complements previously described mebs analysis (normalized entropy scores and metabolic pathway completeness) of the individual genomes.

How to use mebs_clust.py with the Pfam v34.0 (October 2021, 19,179 entries)

See the MEB realease PfamV34 containing the latest Pfam database and the corresponding files to obtain the presence/absence profile

Warning!. This is a heavy file, make sure you have enough disk space before downloading it The compressed file 274M and the uncompressed file is 1.85Gb

First create a directory called pfam in the cicles directory, and cd into that directory

cd cycles/ && mkdir pfam && cd pfam

Inside the pfam directory, download the current pfamv34 MEBS release

wget https://github.com/valdeanda/mebs/releases/download/Pfamv34_18102021/PfamV34_4MEBS_clustering.tar

# Uncompress the data. This might take a while due to the size of the database

tar -xvzf PfamV34_4MEBS_clustering.tar
x PfamV34_mapping.tab
x config.txt
x entropies.tab
x my_Pfam.pfam.hmm

## If you have curiosity, here is a command you can run to extract all the pfams that are currently available in the Pfamv34 database
grep ACC my_Pfam.pfam.hmm|  cut -f 4 -d " " | cut -f 1 -d "." |sed '/^PF/!d' > pfamv34list.txt
less  pfamv34list.txt |wc
rm PfamV34_4MEBS_clustering.tar
._config.txt

Copy the config file into the config directory. This config file already contains path information required for mebs.pl

cp config.txt ../../config
# Go to the main mebs path
cd ../../

Run mebs.pl as described previosly and generate the output file, which will contain the presence/abscence info of all your pfam domains This process will take a while, depending on the number of input genomes. For example using the gen_test, it takes ~30 min for MEBS to search all the pfams in the pfamV34 database. The second time you run the same command, it will only take 1min to generate the profile.

perl mebs.pl -input gen_test/ -type genomic -comp > pfamv43_profile.tsv
# mebs.pl -input gen_test/ -type genomic -fdr 0.01 -comp 1

Extract only the columns containing the information of the pfams.

 cut -f 1,83-19262 pfamv34_profile.tsv  > input4clustering.tab

Run mebs_clust.py with the input file generated with the above comand.

The main script mebs_clust.py compute clustering analysis based on presence abscence matrix. In this case we can use a presence/absence matrix of pfam domains, but you can use any type of presence/absence data. If no argument is given, the script will compute the clustering using using the Jaccard distance metric and the Ward method. For abundance data, Bray-curtis distance-matrix is recommended.

python3 mebs_clust.py  -h
usage: mebs_clust.py [-h] [--outdir OUTDIR]
                     [-im_format {png,pdf,ps,eps,svg,tif,jpg}] [--im_res dpi]
                     [--cutoff CUTOFF]
                     [--method {single,complete,average,weighted,centroid,median,ward}]
                     [--distance {braycurtis,canberra,chebyshev,cityblock,correlation,euclidean,jaccard,mahalanobis}]
                     [--nolegend]
                     filename

positional arguments:
  filename              Input file pfam profile.

optional arguments:
  -h, --help            show this help message and exit
  --outdir OUTDIR, -o OUTDIR
                        Output folder [<filename>_mebs_clust]
  -im_format {png,pdf,ps,eps,svg,tif,jpg}, -f {png,pdf,ps,eps,svg,tif,jpg}
                        Output format for images [pdf].
  --im_res dpi, -r dpi  Output resolution for images in dot per inch (dpi)
                        [dpi].
  --cutoff CUTOFF, -co CUTOFF
                        Cutoff threashold default 0.5
  --method {single,complete,average,weighted,centroid,median,ward}, -m {single,complete,average,weighted,centroid,median,ward}
                        methods for calculating the distance between the newly
                        formed clusters [ward].
  --distance {braycurtis,canberra,chebyshev,cityblock,correlation,euclidean,jaccard,mahalanobis}, -d {braycurtis,canberra,chebyshev,cityblock,correlation,euclidean,jaccard,mahalanobis}
                        The distance metric to use default [jaccard].
  --nolegend, -nl       If specified, the legend is not shown

Example:
$  python3 mebs_clust.py  pfamprofile.tsv

Run the script with basic options

python3 mebs_clust.py input4clustering.tab
[OUTPUT]   Output not specified.
[OUTPUT]   Storing result in : input4clustering.tab_mebs_clust
[OUTPUT]   ... input4clustering.tab_mebs_clust -> does not exists: creating
[INFO] Output path: input4clustering.tab_mebs_clust
[END] mebs_cust.py done 9.......................
[END] Please check your output files in folder 'input4clustering.tab_mebs_clust' :

This will generate a directory containing the information of the clustering.

Individual files of each cluster containing the genomes belonging to that cluster
Figure with the cluster and the specific cut-off used.

_{Figure 11.Clustering dendogram using as input the genomes in the gen_test directory and mebs_clust.py default settings}

Citing papers

Overall cycles through time

In De Anda et al., 2018, we integratef the main entropy scores into a 2 year period time-series analysis to monitor fluctuations in the microbial-metabolic biogeochemical cycles of microbial mats under perturbed conditions in microbial mats.

_{Figure 12.Biogeochemical cycling within microbial mats across space and time using MEBS. (A) Dynamics of the main cycles within microbial mats samples during the two-year period of study by with a single value MEBS captured in bits}

Protein clustering

In Langwig-De Anda et al. 2021 we compared the protein content (>17 000 protein domains) over 1 500 Desulfobacterota, Myxococcota, and SAR324 genomes including 402 new MAGs reconstructed from hydrothermal vent and coastal bay sediments to understand their metabolic and ecological capabilities.

_{Figure 13. Dendrogram of genomic groups (labeled A–H) based on their protein content similarity (determined using Jaccard distance and Ward variance minimization. Genomic groups include 1559 Desulfobacterota, Myxococcota, and SAR324 genomes that were clustered based on their protein family content using a set of 17 395 pfam domains}

Mapping specific genes into a phylogenetic tree

Screening and selecting specific set of pfam domains to map them into a phylogeny

_{Figure 14. The presence methanol methyltransferase MtaB (PF12176) and trimethylamine methyltransferase MttB (PF06253) are shown in the outer circles. The annotation was conducted with MEBS annotating first all the PFAM domains in all the genomes and then selecting the ones of interest. The mapping itol file was obtained with mebs_vis.py. From De Anda et al, 2021}