About MetaGaia
MetaGaia is a pipeline that streamlines the process of calculating bin abundances, identifying the metabolic pathways present in each bin, and the metabolic pathways found in both phages and their hosts. The importance of performing each of these functions is to better understand microbial communities and the dominant microbes present within a community. This pipeline is divided into three sections that perform different functions: 1) Calculating bin abundances (abundance), 2) Identifying metabolic pathways present in each bin (metabolism), and 3) Finding commonalities between phages and their hosts (host-virus). Multiple requirements are needed and are specified in the “Requirements” section of this protocol.
MetaGaia installation
The MetaGaia software is available as an open-source package distributed from a GitHub repository. Thus, the natural way of installing it is by cloning the repository via the following commands:
git clone https://github.com/valdeanda/MetaGaiaPython installation requirements (Linux)
To install python, run sudo apt-get install python3. To install pip, run sudo apt install python3-pip. To install pandas, run pip install pandas. To install numpy, run pip install numpy. To install seaborn, run pip install seaborn.
Install dependencies with Conda (Linux)
To install all of the above dependencies with Conda using the supplied .yml file, run conda env create -f metagaia_conda_noarch.yml
MetaGaia requirements
NOTE: Please do NOT use symbolic links for fastq or fna files. Also, make sure headers are present and properly separated in input files.
Calculating bin abundances (abundance)
- All necessary fastq files have been downloaded and placed within the same directory. All files with the .fastq.gz extension in a directory will be analyzed.
- All necessary fna files have been downloaded and placed within the same directory.
- All necessary depth text files have been downloaded and placed within the same directory or a concatenated depth file has already been created.
- A file has been created that maps each bin to its respective sample site (see “Mapping bins to sample sites” for more information).
- The GTDBTK tool has been downloaded to map each bin to its respective taxa (see “Mapping taxa” for more information).
Identifying metabolic pathways (metabolism)
- The
mapping_file.tsvfile created from the “bin_abundance” section of the pipeline. - An account at the Integrated Microbial Genomes with Microbiome Samples system IMG is needed with the IMG annotation of all of the samples being analyzed.
- File to combine annotation files. (What is this file?)
- A user created file that contains the KEGG, COG, PFAM, and/or EC_NUMBER pathways of interest.
Finding commonalities between phages and their hosts (host-virus)
- The “database_counts” (complete_metabolic_profile.tsv) file from the
metabolic_profile.pyscript that contains only one of the following databases: KEGG, COG, PFAM, or EC_NUMBER. - The output from Vibrant that maps phage scaffolds to metabolic pathways.
Running abundance
Creating bin abundance input files
NOTE: To create the bin_sample file, read the “Mapping bins to samples” section below. To create the depth file, read the “Create initial depth file” section below as well.
- To create all the input files necessary, run the
files_prep.pyscript. Two input files (concatenateddepthfile is optional) are necessary and the directories where thefastqandfna(and optionaldepth) files are located are needed to run this script (see “Download fastqs” for more information on where to obtain these). All fastq files must end with “.fastq.gz”. One of the depth arguments must be present: depth or depth_dir. These are the arguments necessary to run this script:
usage: files_prep.py [-h] -b BIN_SAMPLES [-d DEPTH] [-f DEPTH_DIR] -q FASTQ_DIR -a FNA_DIR
optional arguments:
-h, --help show this help message and exit
-b BIN_SAMPLES, --bin_samples BIN_SAMPLES
Input file containing 2 columns: 1st column contains bin names and 2nd column contains sample names. Requires header for each column, "Bin" and "Sampling_Site"
respectively. File must have tsv, csv, or txt file extension.
-d DEPTH, --depth DEPTH
Input file containing 5 columns: contigName, contigLen, totalAvgDepth, Sample_Depth, and Depth. This file can be created with jgi_summarize_bam_contig_depths. File must
have tsv, csv, or txt file extension.
-f DEPTH_DIR, --depth_dir DEPTH_DIR
Input directory path containing all the depth txt files that must be concatenated. No other txt files should be present within the directory!
-q FASTQ_DIR, --fastq_dir FASTQ_DIR
Input directory path containing all the fastq files.
-a FNA_DIR, --fna_dir FNA_DIR
Input directory path containing all the fna files.Example of concatenated depth file (NOTE: Please make sure the scaffold names in the depth file(s) contain the unique identifier corresponding to the bin that the scaffold originates from. The Sample names in the depth file may need to be edited by the user to match the Sample names in the reads_file.tsv. Also, the “Original_Contig_Name” column in both the depth and mapping files must match by appending the respective “Sampling_Site” prefix to each scaffold in the depth file prior to running the script [See “Create initial depth files” for more information]. WARNING: Samples without depth values are removed if present.):
head depth_file.tsv| contigName | contigLen | TotalAvgDepth | SF_Jul11_USGS_4_1-sorted.bam |
|---|---|---|---|
| S4B_scaffold_41967_c1 | 2430.0 | 15.7917 | 15.7917 |
| S4B_scaffold_22085_c1 | 3249.0 | 12.081 | 12.081 |
| S4B_scaffold_39665_c1 | 2489.0 | 9.998289999999999 | 9.998289999999999 |
| S4B_scaffold_36461_c1 | 2583.0 | 8.70695 | 8.70695 |
| S4B_scaffold_62858_c1 | 2036.0 | 11.1448 | 11.1448 |
| S4B_scaffold_47653_c1 | 2298.0 | 10.1196 | 10.1196 |
| S4B_scaffold_28262_c1 | 2900.0 | 8.95927 | 8.95927 |
Example of bin_sample file (“Sampling_Site” is a column with unique identifiers that identifies the sample where each scaffold originates)
head bin_sample.tsv| Bin | Sampling_Site |
|---|---|
| NoBin | S0A |
| 13_Jan_SF_Bin44 | S4B |
| 14_Jan_SF_Bin36 | S5A |
| 21_Jan_SF_Bin21 | S2C |
| 24_Jan_SF_Bin43 | S4D |
Here is an example of how to run this script:
python3 files_prep.py -b ../../data/example_input_files/files_prep/bin_sample.tsv -d ../../data/example_input_files/files_prep/depth_file.tsv -q ~/fastq_files/ -a ~/fna_files/or (if depth files are not already concatenated):
python3 files_prep.py -b ../../data/example_input_files/files_prep/bin_sample.tsv -f ~/depth_files/ -q ~/fastq_files/ -a ~/fna_files/There should be four files named “reads_file.tsv” (Sample names may need to be edited by user), “mapping_file.tsv”, “depth_file”, and “binsize_file.tsv” saved in the
outputfolder.From this output, you need to rename the reads_file.tsv and depth_file.tsv. These files will look like this:
head reads_file.tsv| Sample | Reads |
|---|---|
| /home/mlangwig/Working_Dir/SF_Bay/fastq/SF_Jul11_sed_USGS_4_1_filter-METAGENOME.fastq.gz | 135081488 |
| /home/mlangwig/Working_Dir/SF_Bay/fastq/SF_Jan12_sed_USGS_4_1_filter-METAGENOME.fastq.gz | 140529442 |
head depth_file_renamed.tsv| Original_Contig_Name | contigLen | totalAvgDepth | Sample | Depth |
|---|---|---|---|---|
| S4C_scaffold_7529_c1 | 6042.0 | 23.1232 | SF_Jul11_USGS_4_1-sorted.bam | 23.1232 |
| S4C_scaffold_8778_c1 | 5493.0 | 9.78027 | SF_Jul11_USGS_4_1-sorted.bam | 9.78027 |
- Rename the reads_file.tsv and depth_file.tsv so that the Sample column from the reads_file.tsv matches the Sample column in the depth_file.tsv. We could not automate this because different projects have unique and specific sample names :
head reads_file_renamed.tsv| Sample | Reads |
|---|---|
| SF_Jul11_sed_USGS_4_1 | 135081488 |
| SF_Jan12_sed_USGS_4_1 | 140529442 |
head depth_file_renamed.tsv| Original_Contig_Name | contigLen | totalAvgDepth | Sample | Depth |
|---|---|---|---|---|
| S4C_scaffold_7529_c1 | 6042.0 | 23.1232 | SF_Jul11_USGS_4_1 | 23.1232 |
| S4C_scaffold_8778_c1 | 5493.0 | 9.78027 | SF_Jul11_USGS_4_1 | 9.78027 |
The following examples show how you could potentially rename these files:
To rename the Sample column in the reads_file.tsv, example
sed 's/\/fastq\//\t/g' reads_file.tsv | cut -f2,3 | sed 's/_filter-METAGENOME.fastq.gz//g' | sed 's/Reads/Sample\tReads/g' | sed 's/sed_//g' > reads_file_renamed.tsvTo rename the Sample column in the depth_file.tsv, example:
- View unique names that need to be renamed
cut -f3 depth_file.tsv | sort | uniq- Rename
sed 's/-sorted.bam//g' depth_file.tsv | sed 's/\.fasta.2000kb//g' > depth_file_renamed.tsv- Make sure the names in the Sample colum of depth_file_renamed.tsv match the names of Sample in reads_file_renamed.tsv
cut -f3 depth_file_renamed.tsv | sort | uniq
cut -f1 reads_file_renamed.tsv | sortCalculate relative bin abundance
- The
bin_abundance.pyscript then takes in a list of input files (processed from by running thefiles_prep.pyscript) that need to be created beforehand. Here are the descriptions of each of the input files:- make sure headers of input files are tab separated
- make sure are giving absolute paths to files (no symbolic links)
- make sure readablenum is large enough to convert all RelativeAbundance values to greater than 1
usage: bin_abundance.py [-h] -r READS -m MAPPING -d DEPTH -s BINSIZE
[-n READABLENUM]
optional arguments:
-h, --help show this help message and exit
-r READS, --reads READS
Total number of reads. Rows are samples column are
reads
-m MAPPING, --mapping MAPPING
Tabular file containing Original_Contig_Name Bin Sample
-d DEPTH, --depth DEPTH
Tabular file depth info Original_Contig_Name contigLen
Sample_Depth Depth
-s BINSIZE, --binsize BINSIZE
Tabular file with Bin and corresponding Genome size
(bp) as columns
-n READABLENUM, --readablenum READABLENUM
A large number to multiply the relative abundances so
that it is human readable.Here is an example of how to run the script (using example data provided in the data folder):
python3 bin_abundance.py -r ../../data/example_input_files/bin_abundance/reads_file_renamed.tsv -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d ../../data/example_input_files/bin_abundance/depth_file_renamed.tsv -s ../../data/example_input_files/bin_abundance/binsize_file.tsv -n 1000000- The output will be a file saved in the
outputdirectory. This file can then be used in the next script:bin_abundance_viz.pyto visualize the data.
Visualize relative bin abundance
ALL input files for this step should be in either tsv or csv format.
- With the
bin_abundance_viz.pyscript, the files outputted frombin_abundance.pycan be used to visualize the relative bin abundances in addition to a files mapping each bin to their respective taxa (see “Mapping taxa” for more information) and each sample to their respective site (see “Mapping sample to site” for more information). These are the arguments needed for the visualization script:
usage: bin_abundance_viz.py [-h] -b BIN_ABUNDANCE -t TAXONOMY_INFO -s
SAMPLE2SITE [-p PERCENT] [-w WIDTH] [-l HEIGHT]
[-d DPI] [-o OUT_FIG] [-c TAXA_COLOR]
optional arguments:
-h, --help show this help message and exit
-b BIN_ABUNDANCE, --bin_abundance BIN_ABUNDANCE
Input tsv file outputted from the bin_abundance.py
script with file extension.
-t TAXONOMY_INFO, --taxonomy_info TAXONOMY_INFO
Input tsv or csv file mapping taxonomy to each bin
with file extension.
-s SAMPLE2SITE, --sample2site SAMPLE2SITE
Input tsv or csv file containing each sample to the
site it was taken from with file extension [""].
-p PERCENT, --percent PERCENT
Percent of highest sample in each bin [10].
-w WIDTH, --width WIDTH
Width of outputted clustermap figure [4].
-l HEIGHT, --height HEIGHT
Height of outputted clustermap figure [5].
-d DPI, --dpi DPI Resolution for output figure file [300].
-o OUT_FIG, --out_fig OUT_FIG
Stores the figure in the specified file path and
format [test.png].
-c TAXA_COLOR, --taxa_color TAXA_COLOR
Input tsv or csv file containing the RGB color code
for each taxa with file extension [""].Example of bin_abundance file:
head bin_abundance.tsv| Bin | Sample | Sampling_Site | RelativeAbundance | RelativeAbundanceReadable |
|---|---|---|---|---|
| 13_May_SF_Bin93 | SF_May12_USGS_13 | S1B | 6.6878693485613E-08 | 6.687869348561300 |
| 8_1_Oct_SF_Bin28 | SF_Oct11_USGS_8_1 | S5D | 5.76751058810138E-08 | 5.767510588101380 |
| 4_1_Jan_SF_Bin44 | SF_Jan12_USGS_4_1 | S4A | 1.52873009442108E-07 | 15.28730094421080 |
| 24_July_SF_Bin15 | SF_Jul11_USGS_24 | S3C | 9.88355407641773E-08 | 9.883554076417730 |
Example of taxonomy_info file:
head taxonomy_info.tsv| Bin | Taxa |
|---|---|
| 13_Jan_SF_Bin36 | Pacearchaeota |
| 13_Jan_SF_Bin40 | Thaumarchaeota_Nitrosopumilus |
| 13_Jan_SF_Bin52 | Thaumarchaeota_Nitrosopumilus |
| 13_Jan_SF_Bin61 | Thermoplasmata |
Example of sample2site file - the Sample names should match the names in the Sample column of reads_file.tsv and depth_file_renamed.tsv; the Site names are the physical location’s name/code that each sample was taken from; each Sample should contain a unique Site name:
head sample2site.tsv| Sample | Site |
|---|---|
| SF_Jan12_USGS_1 | 21 |
| SF_Jan12_USGS_2 | 24 |
| SF_Jul11_USGS_3 | 13 |
| SF_Jul11_USGS_1 | 21 |
Example of taxa_color file:
head taxa_color.tsv#ede4d9
#426ca6
#ffdab9
#fff5e5
#ffefd5
Here is an example of how to run this script:
python3 bin_abundance_viz.py -b ../output/bin_abundance_output.tsv -t ~/tax2bin.tsv -p 10 -w 8 -l 5 -d 500 -o figure1.jpg -c ~/taxa_color.tsv- Outputs will be saved in the
outputdirectory containing the user-specified name.
Running metabolism
NOTE: The only metabolic pathways that are available to analyze are in the KEGG, COG, PFAM, and EC_NUMBER databases.
Map metabolic pathways to the scaffold containing them and get the counts of each pathway per bin
One requirement must be met before running the
metabolic_profile.pyscript, which is that each of the IMG annotation files must be downloaded (see “Creating IMG annotation” for more information) and contain the “Original_Contig_Name” column (created by merging the IMG file with a user-created mapping file containing the columns: “IMG_Contig_Name” and “Original_Contig_Name”).To get the pathways mapped to each scaffold and the number of times each pathway is found within a bin, run the
metabolic_profile.pyscript. These are the arguments for this script:
usage: metabolic_profile.py [-h] [-i IMGANNO_FILE] [-p IMGANNO_PATH] -m MAPPING -d DATABASE [-c CONSISTENCY]
optional arguments:
-h, --help show this help message and exit
-i IMGANNO_FILE, --imganno_file IMGANNO_FILE
Input file or files in tsv format. Rows are genes columns are IMG annotations.
-p IMGANNO_PATH, --imganno_path IMGANNO_PATH
Input path containing only the IMG files. Rows are genes columns are IMG annotations in each file.
-m MAPPING, --mapping MAPPING
A tsv file containing original contig name, sample, and bin columns. Created from the files_prep.py script.
-d DATABASE, --database DATABASE
Database(s) of interest to merge together. Input as a list.
-c CONSISTENCY, --consistency CONSISTENCY
Boolean value that determines if scaffolds not containing a value for each database should be kept. Leave blank if consistency check is not needed.Example of imganno file:
head imganno1.tsv| Original_Contig_Name | IMG_Contig_Name | Locus_Tag | IMG_Gene_ID | Gene_Type | Gene_Start | Gene_Stop | Gene_Length | Homolog_Gene_ID | Homolog_Taxon_ID | Lineage_%ID |
|---|---|---|---|---|---|---|---|---|---|---|
| S1A_scaffold_1_c2 | Ga0224507_10000001 | Ga0224507_100000011 | Ga0224507_10000001.1 | CDS | 2 805 | 803 | 641316135 | 641228499 | 65.06 | Archaea;Thaumarchaeota;unclass |
| S1A_scaffold_1_c2 | Ga0224507_10000001 | Ga0224507_100000012 | Ga0224507_10000001.3 | CDS | 943 | 1254 | 311 | 2518759324 | 2518645532 | 79.61 |
| S1A_scaffold_1_c2 | Ga0224507_10000001 | Ga0224507_100000013 | Ga0224507_10000001.5 | CDS | 1243 | 1911 | 668 | 2518890842 | 2518645576 | 82.81 |
| S1A_scaffold_1_c2 | Ga0224507_10000001 | Ga0224507_100000014 | Ga0224507_10000001.7 | CDS | 1945 | 3255 | 1310 | 641316132 | 641228499 | 88.97 |
| S1A_scaffold_1_c2 | Ga0224507_10000001 | Ga0224507_100000015 | Ga0224507_10000001.9 | CDS | 3248 | 3595 | 347 | 2518890840 | 2518645576 | 72.17 |
Example of mapping file:
head mapping_file.tsv| Original_Contig_Name | Bin | Sampling_Site |
|---|---|---|
| S1B_scaffold_1283_c1 | 13_May_SF_Bin93 | S1B |
| S1B_scaffold_1516_c1 | 13_May_SF_Bin93 | S1B |
| S1B_scaffold_1603_c1 | 13_May_SF_Bin93 | S1B |
| S1B_scaffold_1628_c1 | 13_May_SF_Bin93 | S1B |
Here is an example of how to run this script with a single IMG annotation file:
python3 metabolic_profile.py -i imgannotation.tsv -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d [KEGG, COG, PFAM, EC_NUMBER] -c TrueHere is an example of how to run this script with multiple IMG annotation files:
python3 metabolic_profile.py -p /img_annofiles/ -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d [KEGG, COG, PFAM, EC_NUMBER] -c True- Outputs will be saved in the
outputdirectory containing name formats: “mapped_scaffolds.tsv” and the database name followed by “complete_metabolic_profile.tsv”. NOTE: Old files will NOT be removed as each output is uniquely named.
Extract specific pathways to analyze
One requirement must be met to run the
pathway_extraction.pyscript, which is that the user-defined database must be created (see “Defining metabolic database” for more information).To extract and analyze specific metabolic pathways, run the
pathway_extraction.pyscript. These are the arguments for the script:
usage: pathway_extraction.py [-h] -c CUSTOMDATA -p PATHWAY_DATABASE -d DATABASE
optional arguments:
-h, --help show this help message and exit
-c CUSTOMDATA, --customdata CUSTOMDATA
Input file in tsv format with at least one column with the header containing the pathways of interest. The headers of the metabolic pathway must correspond with its
respective database (eg. KEGG, COG, PFAM, or EC_NUMBER).
-p PATHWAY_DATABASE, --pathway_database PATHWAY_DATABASE
Input file in tsv format that contains the counts of each pathways per bin.
-d DATABASE, --database DATABASE
Database(s) of interest to merge together. Input as a list.Example of customdata file (can have other descriptive columns if needed):
head custom_database.tsv| Pathway | Subpathway | KEGG | COG | PFAM | Description |
|---|---|---|---|---|---|
| Nitrogen | Annamox | K20932 | COG3391 | PF18582 | K20932; hydrazine synthase subunit [EC:1.7.2.7] |
| Nitrogen | Annamox | K20933 | COG1858 | K20933; hydrazine synthase subunit [EC:1.7.2.7] | |
| Nitrogen | Annamox | K20934 | COG0823 | K20934; hydrazine synthase subunit [EC:1.7.2.7] | |
| Nitrogen | Annamox | K20935 | hdh; hydrazine dehydrogenase [EC:1.7.2.8] |
Example of pathway_database file:
head final_merged_metabolic_profile.tsv| KEGG | 13_Jan_SF_Bin1 | 13_Jan_SF_Bin10 | 13_Jan_SF_Bin12 | 13_Jan_SF_Bin13 | 13_Jan_SF_Bin14 | 13_Jan_SF_Bin2 | 13_Jan_SF_Bin21 |
|---|---|---|---|---|---|---|---|
| KO:K00001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00002 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00003 | 2.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| KO:K00004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00005 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00006 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Here is an example of how to run this script:
python3 pathway_extraction.py -c user-defined_database.tsv -p ../../output/final_merged_metabolic_profile.tsv -d [KEGG, COG, PFAM, EC_NUMBER]- Outputs will be places in the
outputdirectory containing the name “extracted_pathways.tsv”.
Running host-virus
Compare phage and host metabolisms to identify commonalities
There are two requirements needed to run this script: 1) the “database_counts” (complete_metabolic_profile.tsv) file outputted using the
metabolic_profilescript from the “metabolism” portion of this pipeline and 2) the Vibrant output file mapping scaffolds to metabolic pathways.To analyze the unique and shared metabolic pathways between the two organisms, run the
amg_hostvi.pyscript. These are the arguments for the script:
usage: amg_hostvi.py [-h] -p PATHWAY_DATABASE [-vf VIBRANT_FILE] [-vp VIBRANT_PATH] -d DATABASE
optional arguments:
-h, --help show this help message and exit
-p PATHWAY_DATABASE, --pathway_database PATHWAY_DATABASE
Input file in tsv format that contains the counts of each pathway per bin.
-vf VIBRANT_FILE, --vibrant_file VIBRANT_FILE
Input file from Vibrant in tsv format that contains phage scaffolds mapped to metabolic pathways.
-vp VIBRANT_PATH, --vibrant_path VIBRANT_PATH
Input path from Vibrant files in tsv format that contains phage scaffolds mapped to metabolic pathways.
-d DATABASE, --database DATABASE
Input only one of the following database names the user is interested in analyzing: KEGG or PFAM.Example of pathway_database file:
head final_merged_metabolic_profile.tsv| KEGG | 13_Jan_SF_Bin1 | 13_Jan_SF_Bin10 | 13_Jan_SF_Bin12 | 13_Jan_SF_Bin13 | 13_Jan_SF_Bin14 | 13_Jan_SF_Bin2 | 13_Jan_SF_Bin21 |
|---|---|---|---|---|---|---|---|
| KO:K00001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00002 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00003 | 2.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| KO:K00004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00005 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00006 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| KO:K00008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Example of vibrant_output file
head vibrant_output.tsv| protein | scaffold | AMG KO | AMG KO name | Pfam | Pfam name |
|---|---|---|---|---|---|
| scaffold_1527_10 | scaffold_1527 | K03639 | “moaA, CNX2; GTP 3’,8-cyclase [EC:4.1.99.22]” | PF04055.21 | Radical SAM superfamily |
| scaffold_1527_19 | scaffold_1527 | K01939 | “purA, ADSS; adenylosuccinate synthase [EC:6.3.4.4]” | PF00709.21 | Adenylosuccinate synthetase |
| scaffold_224_117 | scaffold_224 | K01939 | “purA, ADSS; adenylosuccinate synthase [EC:6.3.4.4]” | PF00709.21 | Adenylosuccinate synthetase |
| scaffold_224_158 | scaffold_224 | K22227 | “ahbD; heme synthase” | PF13186.6 | Iron-sulfur cluster-binding domain |
Here is an example of how to run this script with one file:
python3 amg_hostvi.py -p ../../output/complete_metabolic_profile.tsv -vf ~/vibrant_output.tsv -d KEGGHere is an example of how to run this script with multiple files in a directory:
python3 amg_hostvi.py -p ../../output/complete_metabolic_profile.tsv -vp ~/vibrant_outputs/ -d KEGG- Outputs will be placed in the
outputdirectory containing the names “phage_host_metabolism.tsv” and “phage_host_mapping.tsv”.
Additional commands
Download fastqs
Follow the download instructions found on IMG.
Mapping bins to sample site
The user must manually map each bin to its respective sample location.
Download and annotate genomes using IMG
Annotate metagenomes at the IMG JGI platafform
After completion, you will download a directory of each of your metagenomic samples containing the following subdirectories:
- IMG Data: Contanining the following files
207502.assembled.COG
207502.assembled.crisprs
207502.assembled.EC
207502.assembled.faa
207502.assembled.fna
207502.assembled.gff
207502.assembled.KO
207502.assembled.last.blasttab
207502.assembled.names_map
207502.assembled.Pfam.blout
207502.assembled.phylodist
207502.assembled.product_names
207502.assembled.unsorted.Pfam.hmmout
3300033388
- Metagenome Report Tables: containing the following files
Table_6_-_207017.annotation_parameters.txt
Table_7_-_3300033141.metagenome_properties.txt
Table_8_-_3300033141.taxonomic_composition.txt
Table_9_-_3300033141.functional_diversity.txt
Create initial depth files
After you have annotated and downloaded your genomes using IMG’s JGI platform, there are two possible methods to create the initial depth txt files. The following two methods use BWA and JGI, respectively.
BWA method
Here is an example of how to create the depth txt files using BWA:
- Prep samples one at time for mapping.
#Copy fq files from yggshare to folder
cp 4998_22_23_H2_S10_fastp_bb.fastq.gz /home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/
#Run mdsum
for i in *.gz ; do md5sum $i > $i.md5sum ; done- Run script example for one sample below.
#STARTED 10:39 4/5/2021
#!/bin/bash
fq=/home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/4998_22_23_H2_S10_fastp_bb.fastq.gz
assembly=/home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/2.5kb/renamed/2.5kb_4998_22_23_H2_S10.headers.removed.fa
#Index contigs >=2.5kb assembly subset created in previous steps
bwa index $assembly
echo "Done index"
# Align contigs >=2.5kb against concatenated, interleaved, and trimmed R1 and R2
bwa mem -t 70 $assembly -p $fq > 4998_22_23_H2_S10.sam
echo "Done mapping"
#samtools create bam files- needed for binning
samtools view -bS -F 4 -@10 4998_22_23_H2_S10.sam -o 4998_22_23_H2_S10.bam
echo "Done bam"
#samtools create sorted bam files- needed for jgi_summarize_bam_contig_depths
samtools sort -@10 4998_22_23_H2_S10.bam -o 4998_22_23_H2_S10_sorted.bam
echo "Done sorted bam"
#The file to put the contig by bam depth matrix > depth.txt
#Do not include variance from mean depth along the contig
jgi_summarize_bam_contig_depths --outputDepth depth.txt 4998_22_23_H2_S10_sorted.bam --noIntraDepthVariance
echo "Done depth file"- Run md5sum and delete .sam file (large) before starting next sample.
for i in *.sam ; do md5sum $i > $i.md5sum ; done- Repeat process for each sample.
JGI method
The script, jgi_summarize_bam_contig_depths.cpp, must be used to create the depth file needed in files_prep.py and can be found here along with its usage notes.
Here is an example of how to create the depth txt files using JGI:
- Index bin.
bwa index 13_Jan_SF_Bin5.fna- BWA MEM to obtain SAM files.
for sample in `cat names.txt`; do bwa mem -t 30 13_Jan_SF_Bin5.fna -p /home/mlangwig/Working_Dir/SF_Bay/fastq/$sample.fastq.gz > $sample.sam; done- Samtools view to convert SAM to BAM.
for sample in `cat names.txt`; do samtools view -bS -F 4 -@10 $sample.sam -o $sample.bam; done- Obtain sorted bams.
for sample in `cat names.txt`; do samtools sort -@10 $sample.bam -o $sample-sorted.bam; done- Use sorted bams to get Cov_ files for mmgenome2.
jgi_summarize_bam_contig_depths --outputDepth depth.txt *sorted.bam --noIntraDepthVarianceAppend “Sampling_Site” prefix to all the scaffolds in a sample. “Sampling_Site” is a column with unique identifiers that identifies the sample where each scaffold originates.
Repeat process for each sample.
Mapping taxa
One way to map each the taxa to each bin is to use the GTDBTK tool found here: https://github.com/Ecogenomics/GTDBTk.
Creating IMG annotation
First, log on to the IMG server found here. Then, download and unzip the tar.gz file. Once you have the IMG annotation files, run the consolidateJGIdata.pl script to consolidate the data obtained from JGI IMG into one file. The script can be found here.
Defining metabolic database
When creating a user-defined database, there are multiple format requirements. The metabolic databases of interest should contain one or multiple headers (KEGG, COG, PFAM, EC_NUMBER) depending on which database the user wants to analyze. All other auxiliary columns will be mapped to the bins and added to the final outputted file. Lastly, in each of the database columns, please format each pathway name as it is in the IMG annotation file:
KEGG - KO:K12345
COG - COG12345
PFAM - pfam12345
EC_NUMBER - EC:EC12345Creating Vibrant mapping
To create the Vibrant mapping file, please follow the instructions here. When using Vibrant, please place the sample name of where each scaffold is found within the file name between specific characters (eg. “VIBRANT_AMG_individuals_SF_Jul11_USGS_4_1_scaffold_2kb.tsv”).
Support and Development
This pipeline has been developed by the Baker lab at UT Austin. If there are any questions, comments, or concerns, please email us at either sahilbshah@berkeley.edu or valdeanda@utexas.edu.
Cite
If you find this software useful please cite us as:
Coming soon!