1 Requirements

To use MEBS2, the only requirement is to known the genomes involved in a given metabolism. As an example, you can find the in the test_input directory, a list containing 97 genus of sulfate/elemental sulfur reducing microorganisms described in MEBSv1

1.1 Input list

First generate a file containing the genus of interest.

head test_input/srb.txt 
Desulfobacula
Desulfocaldus
Desulfocella
Desulfofrigus
Desulfonatronum
Desulfonauticus
Desulforegula
Desulforhabdus
Desulfospira
Desulfothermus
Desulfotignum
Thermodesulforhabdus
Archaeoglobus

1.2 Search the complete and non-redundant genomes of your input list

We provide the assembly_refseq.nr2016.txt derived from the assembly sumary file form RefSeq. For more information see the Stage 1 of MEBSv1

for i in `cat test_input/SRB.txt`; do grep $i assembly_refseq.nr2016.txt ; done | cut -f 1,8 >> test_input/genomes_SRB.txt

From those 97 genus of the input list, we obtain a total of 207 complete and non-reduntant genomes, being at least two genomes per genera.

less genomes_SRB.txt  | sort | uniq -c | sort -r
      4 GCF_000243155.2 Desulfitobacterium dehalogenans ATCC 51507
      4 GCF_000243135.2 Desulfitobacterium dichloroeliminans LMG P-21439
      4 GCF_000231405.2 Desulfitobacterium metallireducens DSM 15288
      4 GCF_000020365.1 Desulfobacterium autotrophicum HRM2
      4 GCF_000010045.1 Desulfitobacterium hafniense Y51
      2 GCF_001592435.1 Thermococcus peptonophilus
      2 GCF_001553605.1 Desulfovibrio fairfieldensis

1.3 Get the entropies of your input genomes

perl entropy.pl 

Program to compute Pfam entropies from a list of accession genomes of interest.

usage: entropy.pl [options] 

 -help          Brief help message
 
 -input_dom     tbl-format file with HMM matches produced by hmmsearch (required)

 -input_list     list of selected genomes of interest                   (required) 

 -names         optional list of Refseq assembly annotations to print
                scientific names instead of accesion codes             (optional)

1.4 GenF against PfamA database.

We have previously annotated the Gen and GenF datasets (relase ) 88 Gb the directory /realease to dowload… TODO


genomes_refseq_nr_22122016.faa.all.pfam.tab
genomes_refseq_nr_22122016_size100_cover10.faa.all.pfam.tab
genomes_refseq_nr_22122016_size150_cover10.faa.all.pfam.tab
genomes_refseq_nr_22122016_size200_cover10.faa.all.pfam.tab
genomes_refseq_nr_22122016_size250_cover10.faa.all.pfam.tab
genomes_refseq_nr_22122016_size300_cover10.faa.all.pfam.tab
genomes_refseq_nr_22122016_size30_cover10.faa.all.pfam.tab
genomes_refseq_nr_22122016_size60_cover10.faa.all.pfam.tab
for i in GenF_Pfam; do
perl entropy.pl $i genomes_SRB.txt > $i.SRB.csv
done 

1.5 Generate entropy file

Warning!!! All the files need to have the same number of domains(profiles) in the same column order. This script assumes that these considerations are true, so it cannot find errors in the input files format

python3 scripts/extract_entropies.py test_input_entropy

The above comand generate the entropy file required to compute the score


head test_input_entropy_entropies.tab 

For practical reasons, first change the name of the entropy file to entropies.tab, and then move the later file move the entropy file inside the input directory . In this case, if the user wants to compute the score for several ‘metabolism’ each one must contain an entropy file and the list of genomes of interest.

mv test_input_entropy_entropies.tab entropies.tab  &&  mv entropies.tab test_input

2 Scoring your data

To run the mebsv2 script, you need

  • The entropy file generated with the comand line above inside test_input directory
  • The current PFam A database
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
  • Add the path to your data in the config file, where the first column is the name of your metabolism, the second column the path of your input data and the third column the completeness file.TODO explain a bit more
less less -S  config/config.txt 
Cycle   Path    Comple
srb     test_input/
perl mebsv2.pl 

  Program to compute MEBS for a set of genomic/metagenomic FASTA files in input folder.
  
  usage: mebsv2.pl [options] 

   -help    Brief help message
   
   -input   Folder containing FASTA peptide files (.faa)             (required)

   -type    Nature of input sequences, either 'genomic' or 'metagenomic'  (required)
   
   -db      Database to scann your input files                (required, default Pfam-A.hmm)
   

   -comp    Compute the metabolic completeness                            (optional)

*NOT RUNNING! ERROR LOCATING HMM**


perl mebsv2.pl -input test_genomes  -type genomic -db Pfam-A.hmm 
# mebsv2.pl -input test_genomes -type genomic  -comp 

1
    srb
Enterococcus_durans.faatest_input/

Error: File existence/permissions problem in trying to open HMM file 1.
HMM file 1 not found (nor an .h3m binary of it)

# ERROR: failed to generate test_genomes/Enterococcus_durans.faa.srb.hmmsearch.tab
    NA
Archaeoglobus_profundus_DSM_5631.faatest_input/

Error: File existence/permissions problem in trying to open HMM file 1.
HMM file 1 not found (nor an .h3m binary of it)

# ERROR: failed to generate test_genomes/Archaeoglobus_profundus_DSM_5631.faa.srb.hmmsearch.tab
    NA