About MEBS
Multigenomic Entropy-Based Score or MEBS, allows the user to synthesizes genomic information into a single informative value. This entropy score can be used to infer the likelihood that microbial taxa perform specific metabolic-biogeochemical pathways
MEBS in a nutshell
Relative entropy
How can we organize and simplify complex genomic data to better understand the metabolisms of diverse microbial taxa? This was the major research question behind the development ofMEBS
. It works by synthesizing multiple sources of data on the metabolism of interest (e.g., genes, microbial taxa, and elemental reactions) under the mathematical framework of the Kullback-Leibler divergence (Equation 1), also known as relative entropy H’. An H’ value is assigned to each protein-coding gene (pcg) based on how often the gene is encoded in the genomes of metabolically similar microbial taxa. H’ values close to 0 indicate that a given pcg is widely distributed among microorganisms (e.g., ABC transporters) and is therefore non-informative of the target metabolism. H’ values near or greater than 1 indicate that a given pcg is unique to a group of metabolically similar microbial taxa, whereas negative H’ values indicate a given pcg is not expected to be involved in the target metabolism. MEBS
stores the unique pcgs and their H’ values in an internal databases, which can then be cross-referenced with genomic and metagenomic input data to obtain a single H’ scores for a given metabolic pathway. High H’ scores suggest that microbial taxa can perform the pathways involved in the metabolism of interest.
Equation 1. Kullback-Leibler divergence—also known as relative entropy H’ to measure the difference between probabilities P and Q (see Equation (1) below). In this context, P(i) represents the frequency of protein domain i in the genomes of a given metabolism (observed frequency), while Q(i) represents its frequency in a non-redundant set of genomes (expected frequency). The script to compute the entropy (entropy.pl) requires the list of the genomes of interest and the tabular output file obtained from the scanning a set of non-redundant genomes against the a set of protein families involved in the metabolism of interest. The obtained values of H΄ (in bits) capture to what extent a given Pfam domain informs about the metabolism of interest. In this case, domains with H΄ values close to or greater than 1 correspond to the most informative Pfam domains (for example enriched among S-based genomes), whereas low H΄ values (close to 0) indicate non-informative ones. Negative values correspond to those observed less than expected.
Built-in functions
Currently, the MEBS
software has a built-in function that allows users to accurately and quickly evaluate the likelihood that up to thousands of microbial taxa (genomes) or communities (metagenomes) can perform the metabolic reactions involved in C, N, O, S, and Fe cycling. However, a user can also take a more advanced, step-wise, approach to examine additional metabolic reactions (e.g., As cycling). In addition, MEBS
includes several tools to help users interpret their results: H’ values of specific metabolic pathways can be mapped on visual reconstructions of microbial phylogenies (Fig. 1C); microbial taxa or entire communities can be sorted by their metabolic function (e.g., nutrient assimilation) (Fig. 1D); or communities filling certain metabolic niches (e.g., heavy metal degradation) can be mapped at global scales (Fig. 1E).
MEBS
computational workflow. Omic-derived data of genomes, metagenomes, or MAGs (Metagenome Assembled Genomes) are the first input of the MEBS
pipeline. The black circles represent the simple steps that users must perform before running MEBS
. C) The entropy-based scores derived from MEBS capture the metabolic machinery of biogeochemical cycles (e.g., C, N, O, S, and Fe), which can be mapped on phylogenic trees of microbial taxa. D) Entropy scores can be used by a number of built-in machine learning analyses, such as classification and unsupervised clustering. The summary of protein features in single values may be useful for the discovery of metabolic combinations that are poorly identified by other standard methods. E) Entropy scores for massively distributed genomic datasets can be assembled to examine biogeochemical cycles (e.g., the sulfur cycle) across regional and global domains. F) Entropy scores can be integrated into time-series analysis to monitor fluctuations in the microbial-metabolic biogeochemical cycles in experimental studies under normal or perturbed conditions.
Applications
MEBS
provides an open-access, reproducible, entropy-based platform to efficiently analyze microbial genomic datasets and decode a plethora of metabolic-chemical pathways. The broad application of MEBS
stretches across scientific disciplines. MEBS
can be used to study microbial responses to on-going threats of anthropogenic and climate-linked change associated with sea level rise, ocean acidification, and land-use by providing new information on shifts in microbial-mediated biogeochemical cycling and bottom-up effects on entire ecosystems. MEBS
can identify global environments with microbial taxa that carry the necessary pcgs for the degradation of hydrocarbons, heavy metals, and other pollutants, which lends itself to new bioremediation practices. MEBS
can also be used in translational medicine to develop new tests that differentiate groups of infectious, antibiotic-resistant, bacteria in the human lung. In the future, I see MEBS
software along with other metabolic decoder approaches integrated into sequencer instruments (e.g., MinION), allowing real-time analysis of complex metabolic pathways in nature
Figure 3. MEBS current and future applications
MEBS main components
This software has been tested in Linux and MacOSX systems. Currently MEBS is composed 3 main scripts, some of them depend of other scripts that are found in the MEBS scrpts directory
- mebs.py : Main scripts that compute hmmsearches and entropy scores
- mebs_vis.py: Visualization scripts that generates several human readable files for metabolic analysis and downstream applications
- mebs_clust.py: Compute clustering analysis based on presence abscence matrix
System requirements and installation:
Hmmsearch
In order to use mebs.pl you only need to have perl and hmmsearch installed
To install hmmsearch see documentation webpage to obtain and compile HMMER from source: http://hmmer.org/documentation.html
brew install hmmer # OS/X, HomeBrew
port install hmmer # OS/X, MacPorts
apt install hmmer # Linux (Ubuntu, Debian...)
dnf install hmmer # Linux (Fedora)
yum install hmmer # Linux (older Fedora)
conda install -c bioconda hmmer # Anaconda
Python3
For the visualization and clustering script you will need to have python >= 3.6.2 and several python modules
- numpy
- matplotlib
- pandas
- seaborn
To install the above modules including python3
sudo apt-get install python3 # Linux (Ubuntu, Debian..)
sudo pip3 install -U pip # Linux (Ubuntu, Debian..)
sudo -H pip3 install --upgrade pandas # Linux (Ubuntu, Debian..)
sudo -H pip3 install --upgrade numpy # Linux (Ubuntu, Debian..)
sudo -H pip3 install --upgrade scipy # Linux (Ubuntu, Debian..)
sudo -H pip3 install --upgrade seaborn # Linux (Ubuntu, Debian..)
Conda option
Install conda or miniconda depending of your needs. See conda webpage for more info
Once conda is installed activate conda and use the yml file provided with mebs
For linux users:
conda activate
conda env create -f mebs_env.yml
conda activate mebs_env
Download mebs
git clone https://github.com/valdeanda/mebs.git
cd mebs
Before you start
Main options
Two options are required for MEBS
:
- Input directory containing fasta files of either metagenomes, genomes or MAGs
- Type the type of data. If you have MAGs use the option “genomic”.
By default, mebs will use a False Discovery rate of 0.01. The FDR is the probability that, when you get a ‘significant’ result there is actually a real effect. The FDRs of each cycle are previosly pre-computed and you can find the specific values in the config file. If you want to be very strict considering significative Entropy Scores, you can use a smaller FDR i.e 0.0001. Be aware that the FDR was benchmarked with known cultured data. For example, for Sulfur Entropy Scores >6, it is very likely that those organisms are growing in sulfur compounds. See more details of the Sulfur score benchmarking on the orginal publication of MEBS, figure 5, panel D (De Anda et al., 2017).
perl mebs.pl -h
Program to compute MEBS for a set of genomic/metagenomic FASTA files in input folder.
Version: v1.0
usage: mebs.pl [options]
-help Brief help message
-input Folder containing FASTA peptide files (.faa) (required)
-type Nature of input sequences, either 'genomic' or 'metagenomic' (required)
-fdr Score cycles with False Discovery Rate 0.1 0.01 0.001 0.0001 (optional, default=0.01)
-cycles Show currently supported biogeochemical cycles/pathways
-comp Compute the metabolic completeness of default cycles. (optional)
Required option for mebs_output.py
Considerations
To run
MEBS
, make sure you are inside the mebs directory since the mebs.pl script depends on several files within the mebs directory and if you are trying to run mebs outside this directory it won’t run.The only input data for
MEBS
are fasta files with aminoacid sequences with the .faa extension. If your files have .fa, or .fasta extension it won’t run.hmmsearch must be previosly installed in the path See installations and requirements above. Type the following to make sure everything is up and running.
hmmsearch -h
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>
Example input
To make sure MEBS
is running correctly, please try it first with the example folders provided with genomic and metagenomic fasta files:
Running mebs in the example genomic directory
perl mebs.pl -input gen_test/ -type genomic
Running mebs in the example metagenomic directory
perl mebs.pl -input met_test/ -type metagenomic
Your genomic output should look like this
# mebs.pl -input gen_test -type genomic -fdr 0.01 -comp 0
sulfur carbon oxygen iron nitrogen markers
Archaeoglobus_profundus_DSM_5631.faa 12.024* 24.795* 2.036 0.911 6.928 171.000*
Enterococcus_durans.faa -0.194 0.349 1.433 0.337 2.652 156.000*
Candidatus_Nitrosopumilus_sp_AR2.faa 3.256 5.377 3.229 2.624 9.807 167.000*
Bradyrhizobium_oligotrophicum_S58.faa 1.211 6.366 8.871* 7.482 19.970* 163.000*
Synechococcus_sp_KORDI-52.faa 2.425 4.877 9.746* 3.460 13.434 161.000*
Dechloromonas_aromatica_RCB.faa 3.941 10.369 8.004* 10.100* 17.793* 162.000*
Desulfohalobium_retbaense_DSM_5692.faa 11.511* 9.170 3.521 4.749 9.052 167.000*
Methanosarcina_barkeriMS.faa 5.930 79.606* 5.469 0.208 10.704 177.000*
Computational time
Tu run MEBS
, you don’t need a lot of computational power and it only takes a few min tu run in the genomic dataset. Once the searches are done, the next time you run MEBS
in the same input directory will take a few seconds.
In a MackBook Pro 3.3 GHz Dua-Core Intel Core i7, 16 GB RAM
First time running mebs in the input directory
real 1m41.583s
user 2m24.582s
sys 0m8.679s
Second time running mebs in the same input directory, takes just a few seconds to run
real 0m2.728s
user 0m1.781s
sys 0m0.608s
Running MEBS: Searches and Scores
Run the following example to analyze the metabolism and completeness of genomes or MAGs
perl mebs.pl -input /user_dir/path/ -type genomic -comp > user_data.tsv
Your output should look like this. So now let’s extract some information from this file.
sulfur carbon oxygen iron nitrogen markers <sulfur comp> sulfur_1 sulfur_2 sulfur_3 s
Archaeoglobus_profundus_DSM_5631.faa 12.024* 24.795* 2.036 0.911 6.928 171.000* 59.8 100.0 66.7 80.0 2
Enterococcus_durans.faa -0.194 0.349 1.433 0.337 2.652 156.000* 33.2 50.0 33.3 20.0 25.0 0.0 3
Candidatus_Nitrosopumilus_sp_AR2.faa 3.256 5.377 3.229 2.624 9.807 167.000* 46.4 75.0 100.0 60.0 1
Bradyrhizobium_oligotrophicum_S58.faa 1.211 6.366 8.871* 7.482 19.970* 163.000* 67.9 75.0 0.0 60.0 7
Synechococcus_sp_KORDI-52.faa 2.425 4.877 9.746* 3.460 13.434 161.000* 54.0 75.0 66.7 60.0 37.5 0
Dechloromonas_aromatica_RCB.faa 3.941 10.369 8.004* 10.100* 17.793* 162.000* 71.1 75.0 0.0 60.0 75.0 6
Desulfohalobium_retbaense_DSM_5692.faa 11.511* 9.170 3.521 4.749 9.052 167.000* 65.6 100.0 66.7 80.0 3
Methanosarcina_barkeriMS.faa 5.930 79.606* 5.469 0.208 10.704 177.000* 59.6 50.0 33.3 60.0 25.0 3
After running this command, mebs will generate two types of files for each input *.faa file. Go to the folder where your input files are located, in this example gen_test/ directory and you will see that for each cycle 2 files are generated.
- cycle.hmmsearch.tab
- cycle.score
As an example, let’s explore the sulfur entropy score files for the genome Desulfohalobium_retbaense_DSM_5692.faa.
less Desulfohalobium_retbaense_DSM_5692.faa.sulfur.score
# /Users/val/src/Github_repos/mebs/scripts/pfam_score.pl call:
# -input gen_test/Desulfohalobium_retbaense_DSM_5692.faa.sulfur.hmmsearch.tab -size real -bzip 0 -entropyfile cycles/sulfur/entropies.tab -minentropy -9 -random 0 -keggmap -pathway
# total HMMs with assigned entropy in cycles/sulfur/entropies.tab : 112
# Pfam entropy #matched_peptides
PF00037 0.088 46
PF00005 -0.001 45
PF12838 0.123 43
PF13187 0.166 38
PF00528 0.035 27
PF00890 0.000 25
PF07992 0.000 24
PF13183 0.124 22
PF00534 0.063 17
PF12831 0.017 13
PF13439 0.103 12
PF12800 0.363 12
PF00501 -0.035 12
PF14697 0.345 11
PF01370 0.041 10
PF00009 -0.001 10
PF04069 -0.249 9
PF13247 0.615 9
...
...
...
# Pfam entropy score: 11.511
This file shows the calculated entropy for each pfam found in your input file and the number of matched pfams in your input. In this case, domains with H΄ values close to or greater than 1 correspond to the most informative Pfam domains (enriched among S-based genomes), whereas low H΄ values (close to 0) indicate non-informative ones. Negative values correspond to those observed less than expected At the bottom of this table the calculated entropy score is shown, which is also in your output.tsv file. This is the sum of the entropy values for the pfams shown. Be aware that the score does not consider the abundance of the pfams, but it just takes into consideration the pre-determined entropy values for those given pfams.
To understand the significance of the pfam entropy score, view this Table with pre-computed Entropy Scores for a non-redundant set of microorganisms described in the original publication of MEBS
).Click a cycle to sort by lowest or highest scores. The file containing the normalized-scores for the same dataset are found in the data2vis/ directory
The entropy scores can be analyzed and visualizated as shown below. Please go to the next module for visualization and parsing your results from mebs.pl
Figure 4. Visualization options for the entropy scores derived from mebs.pl main script
Running MEBS: Parsing and visualization
The mebs_vis.p script will use the output generated by mebs.pl and will generate an output directory containing several graphs and files for downstream analysis.
python3 mebs_vis.py -h
usage: mebs_vis.py [-h] [-o OUTDIR] [-im_format {png,pdf,ps,eps,svg,tif,jpg}]
[--im_res dpi]
filename
Parse mebs.pl output and creates several files and figures:
-File to map mebs normalized values to itol => itol_mebs.txt
-File with the metabolic completeness with names => input+completenes.tab
-File to be the output of F_MEBS_cluster.py -s none => input+2_cluster_mebs.txt
-Heatmap with normalized mebs values => inputmebs_heatmap.png
-Heatmap with metabolic completness of S and C => input+comp_heatmap.png
-Barplot with normalized mebs values => input+barplot.png
-Genomic completeness based on marker genes => input+genomic_completeness.tab
-Normalized mebs scores => input+norm_mebs.tab
positional arguments:
filename Input file derived from mebs.pl using -comp option.
optional arguments:
-h, --help show this help message and exit
-o OUTDIR, --outdir OUTDIR
Output folder [<filename>_mebs_vis]
-im_format {png,pdf,ps,eps,svg,tif,jpg}, -f {png,pdf,ps,eps,svg,tif,jpg}
Output format for images [png].
--im_res dpi, -r dpi Output resolution for images in dot per inch (dpi)
[dpi].
Example:
$ python3 mebs_vis.py gen_test.tsv
The easiest way to run the script is the following:
python3 mebs_vis.py gen_test.tsv
This will generate the following output
[OUTPUT] Output not specified.
[OUTPUT] Storing result in : gen_test.tsv_mebs_vis
[OUTPUT] ... gen_test.tsv_mebs_vis -> does not exists: creating
No handles with labels found to put in legend.
[END] Done........................
Please check the following files in folder 'gen_test.tsv_mebs_vis' :
0. Original data without asteriks: gen_test.tsv_mebs_vis/gen_test.tsv.noa
1. Heatmap displaying the metabolic completeness of N,Fe,S and CH4 pathways based on pfams: gen_test.tsvcomp_heatmap.png
2. Barplot with normalized MEBS score values: gen_test.tsv_barplot.png
3. Heatmap with normalized MEBS score values: gen_test.tsv_mebs_heatmap.png
4. Dotplot with normalized MEBS score values: gen_test.tsv_mebs_dotplot.png
5. Completeness file with description of the columns: gen_test.tsv_completenes.tab
6. Mapping file to itol with normalized MEBS scores: gen_test.tsv_itol_mebs.txt
7. Mapping file to itol with pfam metabolic completeness: gen_test.tsv_itol_mebs_comp.txt
8. File to be used as the input of F_MEBS_cluster.py -s none option gen_test.tsv_2_cluster_mebs.txt
9. Genomic completeness based on single copy marker genes: gen_test.tsv _genomic_completenes.tab
10. Normalized mebs scores gen_test.tsv_norm_mebs.tab
If you have a tree file loaded in itol, you can drag directly the _itol.txt files into your tree
and customize the colors of the pathways and the scores as in the following example
https://itol.embl.de/tree/97981518041461538630153
Note: the -fdr option is not supported at this time. Important note to consider is that if you run mebs.pl specifying an fdr option, the script mebs_vis.py will give you an error. Specifying an fdr value will result in an output .tsv that contains NA values and is not compatible with mebs_vis.py*. The seaborn.clustermap() function will throw a ValueError: The condensed distance matrix must contain only finite values.
Normalized entropy scores
In the mebs_vis output directory you will see 10 files (4 png figures and 6 text files)
- gen_test.tsv.noa
- gen_test.tsv_2_cluster_mebs.txt
- gen_test.tsv_barplot.png
- gen_test.tsv_comp_heatmap.png
- gen_test.tsv_genomic_completenes.tab
- gen_test.tsv_itol_mebs.txt
- gen_test.tsv_itol_mebs_comp.txt
- gen_test.tsv_mebs_dotplot.png
- gen_test.tsv_mebs_heatmap.png
- gen_test.tsv_norm_mebs.tab
- gen_test.tsv_pfam_completenes.tab
Let’s explore the files:
The files containing the normalized entropy scores are:
- gen_test.tsv_norm_mebs.tab (raw normalized entropy scores)
- gen_test.tsv_itol_mebs.txt (file to drag to itol to plot the normalized entropy scores)
The aboved mentioned files are used as input to generate the following figures:
- gen_test.tsv_mebs_dotplot.png
- gen_test.tsv_barplot.png
- gen_test.tsv_mebs_heatmap.png
- mebs_itol.jpeg. [Be aware] This file is not provided withing the mebs_vis.py script and it has to be generated by the user draging the file number 6 in the list above into itol
Figure 5. Dotplot of the normalized scores for 8 example genomes
Figure 6. Barplot of the normalized scores for 8 example genomes
Figure 7. Heatmap of the normalized scores for 8 example genomes
Figure 8. Itol visualization using the normalized scores, see interactive version here
Metabolic pathway completeness
By using the -comp option, MEBS
will use the predetermined mapping files for S,N,C,O and Fe located in the cycles directory, named pfa2kegg.tab
For example for the sulfur pathways the mapping file looks like this.
PFAM KO PATHWAY PATHWAY NAME
PF00890 K00394 1 aprAB
PF02910 K00394 1 aprAB
PF12139 K00395 1 aprAB
PF12838 1 aprAB
PF01087 2 apt/sat
PF01747 2 apt/sat
PF14306 2 apt/sat
PF01077 K11180 3 dsrABC
PF03460 K11180 3 dsrABC
PF04358 3 dsrABC
PF00037 K11181 3 dsrABC
PF04358 3 dsrABC
PF02872 K17224 4 Sox system
PF02872 K17224 4 Sox system
PF00034 0 4 Sox system
PF13442 0 4 Sox system
This is the mapping file that will explain the .tsv output from running mebs.pl. For example, “sulfur_1” corresponds to the pfams analyzed for “aprAB”. If your table shows 100% for sulfur_1, this indicates all 4 pfams considered under pathway 1 (aka sulfur_1) are present, and thus aprAB is present.
The output files containing the name of the pathways in a tabular format file with corresponding mapped pathway labels, in other words this file containing the presence/absence profile of certain pfams belonging to either pathways or marker genes for each of the benchmarked cycle
- gen_test.tsv_pfam_completenes.tab
- gen_test.tsv_itol_mebs_comp.txt (same file to be used to map into itol)
This profile file is used as input to generate the following heatmap.
Figure 9. Heatmap displaying the profile abundances of specific protein domains specific to biogeochemical cycles.)
Genomic completeness
The new module incorporated in MEBS
allows you to access the completeness of your genomes or MAGs. The single copy marker genes used in this analysis are found in the mapping file inside the markers directory. MEBS
will compute genomic completeness based on single-copy marker genes found in either bacteria, archaea, or both. Currently, MEBS
uses a set of marker genes previously described in Gutierrez-Preciado et al., 2018 and a second set of marker genes incorporated in miComplete.
As a comparison, I computed the genomic completeness of the genomes found in the gen_test with three different programs: miComplete and checkM and checkM.
As you can see, in the graph below the MEBs completeness values are comparable to the ones obtained with the other two widely used programs. This provides another method for quality control analysis of genomes and MAGs. MEBS
further offers the advantage of computing both genomic completeness and metabolism. Additionally, it is based on the completeness of single-copy marker genes in archaea or bacteria providing a preliminary taxonomic classification ahead of downstream analyses.
Figure 10. Comparision of genomic completeness computed with three different methods: mebs,
Running MEBS: Protein clustering
Protein clustering reduces the dimensionality of your data, which is crucial in the era of large omics datasets. Through this module, you can analyze thousands of genomes through clustering based on their predicted proteins. Thereby, collapsing these genomes into metabolic guilds, clusters, or groups. These guilds will indicate which organisms perform a specific functions, have increased fitness in specific niches, and highlight potential resource partitioning within the interweaved communities. All of which further complements previously described mebs analysis (normalized entropy scores and metabolic pathway completeness) of the individual genomes.
How to use mebs_clust.py with the Pfam v34.0 (October 2021, 19,179 entries)
See the MEB realease PfamV34 containing the latest Pfam database and the corresponding files to obtain the presence/absence profile
Warning!. This is a heavy file, make sure you have enough disk space before downloading it The compressed file 274M and the uncompressed file is 1.85Gb
- First create a directory called pfam in the cicles directory, and cd into that directory
cd cycles/ && mkdir pfam && cd pfam
- Inside the pfam directory, download the current pfamv34 MEBS release
wget https://github.com/valdeanda/mebs/releases/download/Pfamv34_18102021/PfamV34_4MEBS_clustering.tar
# Uncompress the data. This might take a while due to the size of the database
tar -xvzf PfamV34_4MEBS_clustering.tar
x PfamV34_mapping.tab
x config.txt
x entropies.tab
x my_Pfam.pfam.hmm
## If you have curiosity, here is a command you can run to extract all the pfams that are currently available in the Pfamv34 database
grep ACC my_Pfam.pfam.hmm| cut -f 4 -d " " | cut -f 1 -d "." |sed '/^PF/!d' > pfamv34list.txt
less pfamv34list.txt |wc
rm PfamV34_4MEBS_clustering.tar
._config.txt
- Copy the config file into the config directory. This config file already contains path information required for mebs.pl
cp config.txt ../../config
# Go to the main mebs path
cd ../../
- Run mebs.pl as described previosly and generate the output file, which will contain the presence/abscence info of all your pfam domains This process will take a while, depending on the number of input genomes. For example using the gen_test, it takes ~30 min for MEBS to search all the pfams in the pfamV34 database. The second time you run the same command, it will only take 1min to generate the profile.
perl mebs.pl -input gen_test/ -type genomic -comp > pfamv43_profile.tsv
# mebs.pl -input gen_test/ -type genomic -fdr 0.01 -comp 1
- Extract only the columns containing the information of the pfams.
cut -f 1,83-19262 pfamv34_profile.tsv > input4clustering.tab
- Run mebs_clust.py with the input file generated with the above comand.
The main script mebs_clust.py compute clustering analysis based on presence abscence matrix. In this case we can use a presence/absence matrix of pfam domains, but you can use any type of presence/absence data. If no argument is given, the script will compute the clustering using using the Jaccard distance metric and the Ward method. For abundance data, Bray-curtis distance-matrix is recommended.
python3 mebs_clust.py -h
usage: mebs_clust.py [-h] [--outdir OUTDIR]
[-im_format {png,pdf,ps,eps,svg,tif,jpg}] [--im_res dpi]
[--cutoff CUTOFF]
[--method {single,complete,average,weighted,centroid,median,ward}]
[--distance {braycurtis,canberra,chebyshev,cityblock,correlation,euclidean,jaccard,mahalanobis}]
[--nolegend]
filename
positional arguments:
filename Input file pfam profile.
optional arguments:
-h, --help show this help message and exit
--outdir OUTDIR, -o OUTDIR
Output folder [<filename>_mebs_clust]
-im_format {png,pdf,ps,eps,svg,tif,jpg}, -f {png,pdf,ps,eps,svg,tif,jpg}
Output format for images [pdf].
--im_res dpi, -r dpi Output resolution for images in dot per inch (dpi)
[dpi].
--cutoff CUTOFF, -co CUTOFF
Cutoff threashold default 0.5
--method {single,complete,average,weighted,centroid,median,ward}, -m {single,complete,average,weighted,centroid,median,ward}
methods for calculating the distance between the newly
formed clusters [ward].
--distance {braycurtis,canberra,chebyshev,cityblock,correlation,euclidean,jaccard,mahalanobis}, -d {braycurtis,canberra,chebyshev,cityblock,correlation,euclidean,jaccard,mahalanobis}
The distance metric to use default [jaccard].
--nolegend, -nl If specified, the legend is not shown
Example:
$ python3 mebs_clust.py pfamprofile.tsv
Run the script with basic options
python3 mebs_clust.py input4clustering.tab
[OUTPUT] Output not specified.
[OUTPUT] Storing result in : input4clustering.tab_mebs_clust
[OUTPUT] ... input4clustering.tab_mebs_clust -> does not exists: creating
[INFO] Output path: input4clustering.tab_mebs_clust
[END] mebs_cust.py done 9.......................
[END] Please check your output files in folder 'input4clustering.tab_mebs_clust' :
This will generate a directory containing the information of the clustering.
- Individual files of each cluster containing the genomes belonging to that cluster
- Figure with the cluster and the specific cut-off used.
Figure 11.Clustering dendogram using as input the genomes in the gen_test directory and mebs_clust.py default settings
Citing papers
Overall cycles through time
In De Anda et al., 2018, we integratef the main entropy scores into a 2 year period time-series analysis to monitor fluctuations in the microbial-metabolic biogeochemical cycles of microbial mats under perturbed conditions in microbial mats.
Protein clustering
In Langwig-De Anda et al. 2021 we compared the protein content (>17 000 protein domains) over 1 500 Desulfobacterota, Myxococcota, and SAR324 genomes including 402 new MAGs reconstructed from hydrothermal vent and coastal bay sediments to understand their metabolic and ecological capabilities.
Mapping specific genes into a phylogenetic tree
Screening and selecting specific set of pfam domains to map them into a phylogeny