Readme MetaGaia

About MetaGaia

MetaGaia is a pipeline that streamlines the process of calculating bin abundances, identifying the metabolic pathways present in each bin, and the metabolic pathways found in both phages and their hosts. The importance of performing each of these functions is to better understand microbial communities and the dominant microbes present within a community. This pipeline is divided into three sections that perform different functions: 1) Calculating bin abundances (abundance), 2) Identifying metabolic pathways present in each bin (metabolism), and 3) Finding commonalities between phages and their hosts (host-virus). Multiple requirements are needed and are specified in the “Requirements” section of this protocol.

MetaGaia installation

The MetaGaia software is available as an open-source package distributed from a GitHub repository. Thus, the natural way of installing it is by cloning the repository via the following commands:

git clone https://github.com/valdeanda/MetaGaia

Python installation requirements (Linux)

To install python, run sudo apt-get install python3. To install pip, run sudo apt install python3-pip. To install pandas, run pip install pandas. To install numpy, run pip install numpy. To install seaborn, run pip install seaborn.

Install dependencies with Conda (Linux)

To install all of the above dependencies with Conda using the supplied .yml file, run conda env create -f metagaia_conda_noarch.yml

MetaGaia requirements

NOTE: Please do NOT use symbolic links for fastq or fna files. Also, make sure headers are present and properly separated in input files.

Calculating bin abundances (abundance)

  1. All necessary fastq files have been downloaded and placed within the same directory. All files with the .fastq.gz extension in a directory will be analyzed.
  2. All necessary fna files have been downloaded and placed within the same directory.
  3. All necessary depth text files have been downloaded and placed within the same directory or a concatenated depth file has already been created.
  4. A file has been created that maps each bin to its respective sample site (see “Mapping bins to sample sites” for more information).
  5. The GTDBTK tool has been downloaded to map each bin to its respective taxa (see “Mapping taxa” for more information).

Identifying metabolic pathways (metabolism)

  1. The mapping_file.tsv file created from the “bin_abundance” section of the pipeline.
  2. An account at the Integrated Microbial Genomes with Microbiome Samples system IMG is needed with the IMG annotation of all of the samples being analyzed.
  3. File to combine annotation files. (What is this file?)
  4. A user created file that contains the KEGG, COG, PFAM, and/or EC_NUMBER pathways of interest.

Finding commonalities between phages and their hosts (host-virus)

  1. The “database_counts” (complete_metabolic_profile.tsv) file from the metabolic_profile.py script that contains only one of the following databases: KEGG, COG, PFAM, or EC_NUMBER.
  2. The output from Vibrant that maps phage scaffolds to metabolic pathways.

Running abundance

Creating bin abundance input files

NOTE: To create the bin_sample file, read the “Mapping bins to samples” section below. To create the depth file, read the “Create initial depth file” section below as well.

  1. To create all the input files necessary, run the files_prep.py script. Two input files (concatenated depth file is optional) are necessary and the directories where the fastq and fna (and optional depth) files are located are needed to run this script (see “Download fastqs” for more information on where to obtain these). All fastq files must end with “.fastq.gz”. One of the depth arguments must be present: depth or depth_dir. These are the arguments necessary to run this script:
usage: files_prep.py [-h] -b BIN_SAMPLES [-d DEPTH] [-f DEPTH_DIR] -q FASTQ_DIR -a FNA_DIR

optional arguments:
  -h, --help            show this help message and exit
  -b BIN_SAMPLES, --bin_samples BIN_SAMPLES
                        Input file containing 2 columns: 1st column contains bin names and 2nd column contains sample names. Requires header for each column, "Bin" and "Sampling_Site"
                        respectively. File must have tsv, csv, or txt file extension.
  -d DEPTH, --depth DEPTH
                        Input file containing 5 columns: contigName, contigLen, totalAvgDepth, Sample_Depth, and Depth. This file can be created with jgi_summarize_bam_contig_depths. File must
                        have tsv, csv, or txt file extension.
  -f DEPTH_DIR, --depth_dir DEPTH_DIR
                        Input directory path containing all the depth txt files that must be concatenated. No other txt files should be present within the directory!
  -q FASTQ_DIR, --fastq_dir FASTQ_DIR
                        Input directory path containing all the fastq files.
  -a FNA_DIR, --fna_dir FNA_DIR
                        Input directory path containing all the fna files.

Example of concatenated depth file (NOTE: Please make sure the scaffold names in the depth file(s) contain the unique identifier corresponding to the bin that the scaffold originates from. The Sample names in the depth file may need to be edited by the user to match the Sample names in the reads_file.tsv. Also, the “Original_Contig_Name” column in both the depth and mapping files must match by appending the respective “Sampling_Site” prefix to each scaffold in the depth file prior to running the script [See “Create initial depth files” for more information]. WARNING: Samples without depth values are removed if present.):

head depth_file.tsv
contigName contigLen TotalAvgDepth SF_Jul11_USGS_4_1-sorted.bam
S4B_scaffold_41967_c1 2430.0 15.7917 15.7917
S4B_scaffold_22085_c1 3249.0 12.081 12.081
S4B_scaffold_39665_c1 2489.0 9.998289999999999 9.998289999999999
S4B_scaffold_36461_c1 2583.0 8.70695 8.70695
S4B_scaffold_62858_c1 2036.0 11.1448 11.1448
S4B_scaffold_47653_c1 2298.0 10.1196 10.1196
S4B_scaffold_28262_c1 2900.0 8.95927 8.95927

Example of bin_sample file (“Sampling_Site” is a column with unique identifiers that identifies the sample where each scaffold originates)

head bin_sample.tsv
Bin Sampling_Site
NoBin S0A
13_Jan_SF_Bin44 S4B
14_Jan_SF_Bin36 S5A
21_Jan_SF_Bin21 S2C
24_Jan_SF_Bin43 S4D

Here is an example of how to run this script:

python3 files_prep.py -b ../../data/example_input_files/files_prep/bin_sample.tsv -d ../../data/example_input_files/files_prep/depth_file.tsv -q ~/fastq_files/ -a ~/fna_files/

or (if depth files are not already concatenated):

python3 files_prep.py -b ../../data/example_input_files/files_prep/bin_sample.tsv -f ~/depth_files/ -q ~/fastq_files/ -a ~/fna_files/
  1. There should be four files named “reads_file.tsv” (Sample names may need to be edited by user), “mapping_file.tsv”, “depth_file”, and “binsize_file.tsv” saved in the output folder.

  2. From this output, you need to rename the reads_file.tsv and depth_file.tsv. These files will look like this:

head reads_file.tsv
Sample Reads
/home/mlangwig/Working_Dir/SF_Bay/fastq/SF_Jul11_sed_USGS_4_1_filter-METAGENOME.fastq.gz 135081488
/home/mlangwig/Working_Dir/SF_Bay/fastq/SF_Jan12_sed_USGS_4_1_filter-METAGENOME.fastq.gz 140529442
head depth_file_renamed.tsv
Original_Contig_Name contigLen totalAvgDepth Sample Depth
S4C_scaffold_7529_c1 6042.0 23.1232 SF_Jul11_USGS_4_1-sorted.bam 23.1232
S4C_scaffold_8778_c1 5493.0 9.78027 SF_Jul11_USGS_4_1-sorted.bam 9.78027
  1. Rename the reads_file.tsv and depth_file.tsv so that the Sample column from the reads_file.tsv matches the Sample column in the depth_file.tsv. We could not automate this because different projects have unique and specific sample names :
head reads_file_renamed.tsv
Sample Reads
SF_Jul11_sed_USGS_4_1 135081488
SF_Jan12_sed_USGS_4_1 140529442
head depth_file_renamed.tsv
Original_Contig_Name contigLen totalAvgDepth Sample Depth
S4C_scaffold_7529_c1 6042.0 23.1232 SF_Jul11_USGS_4_1 23.1232
S4C_scaffold_8778_c1 5493.0 9.78027 SF_Jul11_USGS_4_1 9.78027

The following examples show how you could potentially rename these files:

To rename the Sample column in the reads_file.tsv, example

sed 's/\/fastq\//\t/g' reads_file.tsv | cut -f2,3 | sed 's/_filter-METAGENOME.fastq.gz//g' | sed 's/Reads/Sample\tReads/g' | sed 's/sed_//g' > reads_file_renamed.tsv

To rename the Sample column in the depth_file.tsv, example:

  1. View unique names that need to be renamed
cut -f3 depth_file.tsv | sort | uniq
  1. Rename
sed 's/-sorted.bam//g' depth_file.tsv | sed 's/\.fasta.2000kb//g' > depth_file_renamed.tsv
  1. Make sure the names in the Sample colum of depth_file_renamed.tsv match the names of Sample in reads_file_renamed.tsv
cut -f3 depth_file_renamed.tsv | sort | uniq

cut -f1 reads_file_renamed.tsv | sort

Calculate relative bin abundance

  1. The bin_abundance.py script then takes in a list of input files (processed from by running the files_prep.py script) that need to be created beforehand. Here are the descriptions of each of the input files:
    • make sure headers of input files are tab separated
    • make sure are giving absolute paths to files (no symbolic links)
    • make sure readablenum is large enough to convert all RelativeAbundance values to greater than 1
usage: bin_abundance.py [-h] -r READS -m MAPPING -d DEPTH -s BINSIZE
                        [-n READABLENUM]

optional arguments:
  -h, --help            show this help message and exit
  -r READS, --reads READS
                        Total number of reads. Rows are samples column are
                        reads
  -m MAPPING, --mapping MAPPING
                        Tabular file containing Original_Contig_Name Bin Sample
  -d DEPTH, --depth DEPTH
                        Tabular file depth info Original_Contig_Name contigLen
                        Sample_Depth Depth
  -s BINSIZE, --binsize BINSIZE
                        Tabular file with Bin and corresponding Genome size
                        (bp) as columns
  -n READABLENUM, --readablenum READABLENUM
                        A large number to multiply the relative abundances so
                        that it is human readable.

Here is an example of how to run the script (using example data provided in the data folder):

python3 bin_abundance.py -r ../../data/example_input_files/bin_abundance/reads_file_renamed.tsv -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d ../../data/example_input_files/bin_abundance/depth_file_renamed.tsv -s ../../data/example_input_files/bin_abundance/binsize_file.tsv -n 1000000
  1. The output will be a file saved in the output directory. This file can then be used in the next script: bin_abundance_viz.py to visualize the data.

Visualize relative bin abundance

ALL input files for this step should be in either tsv or csv format.

  1. With the bin_abundance_viz.py script, the files outputted from bin_abundance.py can be used to visualize the relative bin abundances in addition to a files mapping each bin to their respective taxa (see “Mapping taxa” for more information) and each sample to their respective site (see “Mapping sample to site” for more information). These are the arguments needed for the visualization script:
usage: bin_abundance_viz.py [-h] -b BIN_ABUNDANCE -t TAXONOMY_INFO -s
                            SAMPLE2SITE [-p PERCENT] [-w WIDTH] [-l HEIGHT]
                            [-d DPI] [-o OUT_FIG] [-c TAXA_COLOR]

optional arguments:
  -h, --help            show this help message and exit
  -b BIN_ABUNDANCE, --bin_abundance BIN_ABUNDANCE
                        Input tsv file outputted from the bin_abundance.py
                        script with file extension.
  -t TAXONOMY_INFO, --taxonomy_info TAXONOMY_INFO
                        Input tsv or csv file mapping taxonomy to each bin
                        with file extension.
  -s SAMPLE2SITE, --sample2site SAMPLE2SITE
                      Input tsv or csv file containing each sample to the
                      site it was taken from with file extension [""].
  -p PERCENT, --percent PERCENT
                        Percent of highest sample in each bin [10].
  -w WIDTH, --width WIDTH
                        Width of outputted clustermap figure [4].
  -l HEIGHT, --height HEIGHT
                        Height of outputted clustermap figure [5].
  -d DPI, --dpi DPI     Resolution for output figure file [300].
  -o OUT_FIG, --out_fig OUT_FIG
                        Stores the figure in the specified file path and
                        format [test.png].
  -c TAXA_COLOR, --taxa_color TAXA_COLOR
                        Input tsv or csv file containing the RGB color code
                        for each taxa with file extension [""].

Example of bin_abundance file:

head bin_abundance.tsv
Bin Sample Sampling_Site RelativeAbundance RelativeAbundanceReadable
13_May_SF_Bin93 SF_May12_USGS_13 S1B 6.6878693485613E-08 6.687869348561300
8_1_Oct_SF_Bin28 SF_Oct11_USGS_8_1 S5D 5.76751058810138E-08 5.767510588101380
4_1_Jan_SF_Bin44 SF_Jan12_USGS_4_1 S4A 1.52873009442108E-07 15.28730094421080
24_July_SF_Bin15 SF_Jul11_USGS_24 S3C 9.88355407641773E-08 9.883554076417730

Example of taxonomy_info file:

head taxonomy_info.tsv
Bin Taxa
13_Jan_SF_Bin36 Pacearchaeota
13_Jan_SF_Bin40 Thaumarchaeota_Nitrosopumilus
13_Jan_SF_Bin52 Thaumarchaeota_Nitrosopumilus
13_Jan_SF_Bin61 Thermoplasmata

Example of sample2site file - the Sample names should match the names in the Sample column of reads_file.tsv and depth_file_renamed.tsv; the Site names are the physical location’s name/code that each sample was taken from; each Sample should contain a unique Site name:

head sample2site.tsv
Sample Site
SF_Jan12_USGS_1 21
SF_Jan12_USGS_2 24
SF_Jul11_USGS_3 13
SF_Jul11_USGS_1 21

Example of taxa_color file:

head taxa_color.tsv
#a3c1ad
#ede4d9
#426ca6
#ffdab9
#fff5e5
#ffefd5

Here is an example of how to run this script:

python3 bin_abundance_viz.py -b ../output/bin_abundance_output.tsv -t ~/tax2bin.tsv -p 10 -w 8 -l 5 -d 500 -o figure1.jpg -c ~/taxa_color.tsv
  1. Outputs will be saved in the output directory containing the user-specified name.

Running metabolism

NOTE: The only metabolic pathways that are available to analyze are in the KEGG, COG, PFAM, and EC_NUMBER databases.

Map metabolic pathways to the scaffold containing them and get the counts of each pathway per bin

  1. One requirement must be met before running the metabolic_profile.py script, which is that each of the IMG annotation files must be downloaded (see “Creating IMG annotation” for more information) and contain the “Original_Contig_Name” column (created by merging the IMG file with a user-created mapping file containing the columns: “IMG_Contig_Name” and “Original_Contig_Name”).

  2. To get the pathways mapped to each scaffold and the number of times each pathway is found within a bin, run the metabolic_profile.py script. These are the arguments for this script:

usage: metabolic_profile.py [-h] [-i IMGANNO_FILE] [-p IMGANNO_PATH] -m MAPPING -d DATABASE [-c CONSISTENCY]

optional arguments:
  -h, --help            show this help message and exit
  -i IMGANNO_FILE, --imganno_file IMGANNO_FILE
                        Input file or files in tsv format. Rows are genes columns are IMG annotations.
  -p IMGANNO_PATH, --imganno_path IMGANNO_PATH
                        Input path containing only the IMG files. Rows are genes columns are IMG annotations in each file.
  -m MAPPING, --mapping MAPPING
                        A tsv file containing original contig name, sample, and bin columns. Created from the files_prep.py script.
  -d DATABASE, --database DATABASE
                        Database(s) of interest to merge together. Input as a list.
  -c CONSISTENCY, --consistency CONSISTENCY
                        Boolean value that determines if scaffolds not containing a value for each database should be kept. Leave blank if consistency check is not needed.

Example of imganno file:

head imganno1.tsv
Original_Contig_Name IMG_Contig_Name Locus_Tag IMG_Gene_ID Gene_Type Gene_Start Gene_Stop Gene_Length Homolog_Gene_ID Homolog_Taxon_ID Lineage_%ID
S1A_scaffold_1_c2 Ga0224507_10000001 Ga0224507_100000011 Ga0224507_10000001.1 CDS 2 805 803 641316135 641228499 65.06 Archaea;Thaumarchaeota;unclass
S1A_scaffold_1_c2 Ga0224507_10000001 Ga0224507_100000012 Ga0224507_10000001.3 CDS 943 1254 311 2518759324 2518645532 79.61
S1A_scaffold_1_c2 Ga0224507_10000001 Ga0224507_100000013 Ga0224507_10000001.5 CDS 1243 1911 668 2518890842 2518645576 82.81
S1A_scaffold_1_c2 Ga0224507_10000001 Ga0224507_100000014 Ga0224507_10000001.7 CDS 1945 3255 1310 641316132 641228499 88.97
S1A_scaffold_1_c2 Ga0224507_10000001 Ga0224507_100000015 Ga0224507_10000001.9 CDS 3248 3595 347 2518890840 2518645576 72.17

Example of mapping file:

head mapping_file.tsv
Original_Contig_Name Bin Sampling_Site
S1B_scaffold_1283_c1 13_May_SF_Bin93 S1B
S1B_scaffold_1516_c1 13_May_SF_Bin93 S1B
S1B_scaffold_1603_c1 13_May_SF_Bin93 S1B
S1B_scaffold_1628_c1 13_May_SF_Bin93 S1B

Here is an example of how to run this script with a single IMG annotation file:

python3 metabolic_profile.py -i imgannotation.tsv -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d [KEGG, COG, PFAM, EC_NUMBER] -c True

Here is an example of how to run this script with multiple IMG annotation files:

python3 metabolic_profile.py -p /img_annofiles/ -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d [KEGG, COG, PFAM, EC_NUMBER] -c True
  1. Outputs will be saved in the output directory containing name formats: “mapped_scaffolds.tsv” and the database name followed by “complete_metabolic_profile.tsv”. NOTE: Old files will NOT be removed as each output is uniquely named.

Extract specific pathways to analyze

  1. One requirement must be met to run the pathway_extraction.py script, which is that the user-defined database must be created (see “Defining metabolic database” for more information).

  2. To extract and analyze specific metabolic pathways, run the pathway_extraction.py script. These are the arguments for the script:

usage: pathway_extraction.py [-h] -c CUSTOMDATA -p PATHWAY_DATABASE -d DATABASE

optional arguments:
  -h, --help            show this help message and exit
  -c CUSTOMDATA, --customdata CUSTOMDATA
                        Input file in tsv format with at least one column with the header containing the pathways of interest. The headers of the metabolic pathway must correspond with its
                        respective database (eg. KEGG, COG, PFAM, or EC_NUMBER).
  -p PATHWAY_DATABASE, --pathway_database PATHWAY_DATABASE
                        Input file in tsv format that contains the counts of each pathways per bin.
  -d DATABASE, --database DATABASE
                        Database(s) of interest to merge together. Input as a list.

Example of customdata file (can have other descriptive columns if needed):

head custom_database.tsv
Pathway Subpathway KEGG COG PFAM Description
Nitrogen Annamox K20932 COG3391 PF18582 K20932; hydrazine synthase subunit [EC:1.7.2.7]
Nitrogen Annamox K20933 COG1858 K20933; hydrazine synthase subunit [EC:1.7.2.7]
Nitrogen Annamox K20934 COG0823 K20934; hydrazine synthase subunit [EC:1.7.2.7]
Nitrogen Annamox K20935 hdh; hydrazine dehydrogenase [EC:1.7.2.8]

Example of pathway_database file:

head final_merged_metabolic_profile.tsv
KEGG 13_Jan_SF_Bin1 13_Jan_SF_Bin10 13_Jan_SF_Bin12 13_Jan_SF_Bin13 13_Jan_SF_Bin14 13_Jan_SF_Bin2 13_Jan_SF_Bin21
KO:K00001 0.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00002 1.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00003 2.0 0.0 1.0 1.0 0.0 0.0 1.0
KO:K00004 0.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00005 0.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00006 0.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00008 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Here is an example of how to run this script:

python3 pathway_extraction.py -c user-defined_database.tsv -p ../../output/final_merged_metabolic_profile.tsv -d [KEGG, COG, PFAM, EC_NUMBER]
  1. Outputs will be places in the output directory containing the name “extracted_pathways.tsv”.

Running host-virus

Compare phage and host metabolisms to identify commonalities

  1. There are two requirements needed to run this script: 1) the “database_counts” (complete_metabolic_profile.tsv) file outputted using the metabolic_profile script from the “metabolism” portion of this pipeline and 2) the Vibrant output file mapping scaffolds to metabolic pathways.

  2. To analyze the unique and shared metabolic pathways between the two organisms, run the amg_hostvi.py script. These are the arguments for the script:

usage: amg_hostvi.py [-h] -p PATHWAY_DATABASE [-vf VIBRANT_FILE] [-vp VIBRANT_PATH] -d DATABASE

optional arguments:
  -h, --help            show this help message and exit
  -p PATHWAY_DATABASE, --pathway_database PATHWAY_DATABASE
                        Input file in tsv format that contains the counts of each pathway per bin.
  -vf VIBRANT_FILE, --vibrant_file VIBRANT_FILE
                        Input file from Vibrant in tsv format that contains phage scaffolds mapped to metabolic pathways.
  -vp VIBRANT_PATH, --vibrant_path VIBRANT_PATH
                        Input path from Vibrant files in tsv format that contains phage scaffolds mapped to metabolic pathways.
  -d DATABASE, --database DATABASE
                        Input only one of the following database names the user is interested in analyzing: KEGG or PFAM.

Example of pathway_database file:

head final_merged_metabolic_profile.tsv
KEGG 13_Jan_SF_Bin1 13_Jan_SF_Bin10 13_Jan_SF_Bin12 13_Jan_SF_Bin13 13_Jan_SF_Bin14 13_Jan_SF_Bin2 13_Jan_SF_Bin21
KO:K00001 0.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00002 1.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00003 2.0 0.0 1.0 1.0 0.0 0.0 1.0
KO:K00004 0.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00005 0.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00006 0.0 0.0 0.0 0.0 0.0 0.0 0.0
KO:K00008 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Example of vibrant_output file

head vibrant_output.tsv
protein scaffold AMG KO AMG KO name Pfam Pfam name
scaffold_1527_10 scaffold_1527 K03639 “moaA, CNX2; GTP 3’,8-cyclase [EC:4.1.99.22]” PF04055.21 Radical SAM superfamily
scaffold_1527_19 scaffold_1527 K01939 “purA, ADSS; adenylosuccinate synthase [EC:6.3.4.4]” PF00709.21 Adenylosuccinate synthetase
scaffold_224_117 scaffold_224 K01939 “purA, ADSS; adenylosuccinate synthase [EC:6.3.4.4]” PF00709.21 Adenylosuccinate synthetase
scaffold_224_158 scaffold_224 K22227 “ahbD; heme synthase” PF13186.6 Iron-sulfur cluster-binding domain

Here is an example of how to run this script with one file:

python3 amg_hostvi.py -p ../../output/complete_metabolic_profile.tsv -vf ~/vibrant_output.tsv -d KEGG

Here is an example of how to run this script with multiple files in a directory:

python3 amg_hostvi.py -p ../../output/complete_metabolic_profile.tsv -vp ~/vibrant_outputs/ -d KEGG
  1. Outputs will be placed in the output directory containing the names “phage_host_metabolism.tsv” and “phage_host_mapping.tsv”.

Additional commands

Download fastqs

Follow the download instructions found on IMG.

Mapping bins to sample site

The user must manually map each bin to its respective sample location.

Download and annotate genomes using IMG

Annotate metagenomes at the IMG JGI platafform

After completion, you will download a directory of each of your metagenomic samples containing the following subdirectories:

  1. IMG Data: Contanining the following files
207502.assembled.COG
207502.assembled.crisprs
207502.assembled.EC
207502.assembled.faa
207502.assembled.fna
207502.assembled.gff
207502.assembled.KO
207502.assembled.last.blasttab
207502.assembled.names_map
207502.assembled.Pfam.blout
207502.assembled.phylodist
207502.assembled.product_names
207502.assembled.unsorted.Pfam.hmmout
3300033388
  1. Metagenome Report Tables: containing the following files
Table_6_-_207017.annotation_parameters.txt
Table_7_-_3300033141.metagenome_properties.txt
Table_8_-_3300033141.taxonomic_composition.txt
Table_9_-_3300033141.functional_diversity.txt

Create initial depth files

After you have annotated and downloaded your genomes using IMG’s JGI platform, there are two possible methods to create the initial depth txt files. The following two methods use BWA and JGI, respectively.

BWA method

Here is an example of how to create the depth txt files using BWA:

  1. Prep samples one at time for mapping.
#Copy fq files from yggshare to folder
cp 4998_22_23_H2_S10_fastp_bb.fastq.gz /home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/

#Run mdsum
for i in *.gz ; do md5sum $i > $i.md5sum ; done
  1. Run script example for one sample below.
#STARTED 10:39 4/5/2021
#!/bin/bash
fq=/home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/4998_22_23_H2_S10_fastp_bb.fastq.gz
assembly=/home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/2.5kb/renamed/2.5kb_4998_22_23_H2_S10.headers.removed.fa

#Index contigs >=2.5kb assembly subset created in previous steps

bwa index $assembly
echo "Done index"

# Align contigs >=2.5kb against concatenated, interleaved, and trimmed R1 and R2
bwa mem -t 70 $assembly -p $fq > 4998_22_23_H2_S10.sam
echo "Done mapping"

#samtools create bam files- needed for binning
samtools view -bS -F 4 -@10 4998_22_23_H2_S10.sam -o 4998_22_23_H2_S10.bam
echo "Done bam"

#samtools create sorted bam files- needed for jgi_summarize_bam_contig_depths
samtools sort -@10 4998_22_23_H2_S10.bam -o 4998_22_23_H2_S10_sorted.bam
echo "Done sorted bam"

#The file to put the contig by bam depth matrix > depth.txt
#Do not include variance from mean depth along the contig
jgi_summarize_bam_contig_depths  --outputDepth depth.txt 4998_22_23_H2_S10_sorted.bam --noIntraDepthVariance
echo "Done depth file"
  1. Run md5sum and delete .sam file (large) before starting next sample.
for i in *.sam ; do md5sum $i > $i.md5sum ; done
  1. Repeat process for each sample.

JGI method

The script, jgi_summarize_bam_contig_depths.cpp, must be used to create the depth file needed in files_prep.py and can be found here along with its usage notes.

Here is an example of how to create the depth txt files using JGI:

  1. Index bin.
bwa index 13_Jan_SF_Bin5.fna
  1. BWA MEM to obtain SAM files.
for sample in `cat names.txt`; do bwa mem -t 30 13_Jan_SF_Bin5.fna -p /home/mlangwig/Working_Dir/SF_Bay/fastq/$sample.fastq.gz > $sample.sam; done
  1. Samtools view to convert SAM to BAM.
for sample in `cat names.txt`; do samtools view -bS -F 4 -@10  $sample.sam -o $sample.bam; done
  1. Obtain sorted bams.
for sample in `cat names.txt`; do samtools sort -@10 $sample.bam -o $sample-sorted.bam; done
  1. Use sorted bams to get Cov_ files for mmgenome2.
jgi_summarize_bam_contig_depths --outputDepth depth.txt *sorted.bam --noIntraDepthVariance
  1. Append “Sampling_Site” prefix to all the scaffolds in a sample. “Sampling_Site” is a column with unique identifiers that identifies the sample where each scaffold originates.

  2. Repeat process for each sample.

Mapping taxa

One way to map each the taxa to each bin is to use the GTDBTK tool found here: https://github.com/Ecogenomics/GTDBTk.

Creating IMG annotation

First, log on to the IMG server found here. Then, download and unzip the tar.gz file. Once you have the IMG annotation files, run the consolidateJGIdata.pl script to consolidate the data obtained from JGI IMG into one file. The script can be found here.

Defining metabolic database

When creating a user-defined database, there are multiple format requirements. The metabolic databases of interest should contain one or multiple headers (KEGG, COG, PFAM, EC_NUMBER) depending on which database the user wants to analyze. All other auxiliary columns will be mapped to the bins and added to the final outputted file. Lastly, in each of the database columns, please format each pathway name as it is in the IMG annotation file:

KEGG - KO:K12345
COG - COG12345
PFAM - pfam12345
EC_NUMBER - EC:EC12345

Creating Vibrant mapping

To create the Vibrant mapping file, please follow the instructions here. When using Vibrant, please place the sample name of where each scaffold is found within the file name between specific characters (eg. “VIBRANT_AMG_individuals_SF_Jul11_USGS_4_1_scaffold_2kb.tsv”).

Support and Development

This pipeline has been developed by the Baker lab at UT Austin. If there are any questions, comments, or concerns, please email us at either or .

Cite

If you find this software useful please cite us as:

Coming soon!