Readme MetaGaia

About MetaGaia

MetaGaia is a pipeline that streamlines the process of calculating bin abundances, identifying the metabolic pathways present in each bin, and the metabolic pathways found in both phages and their hosts. The importance of performing each of these functions is to better understand microbial communities and the dominant microbes present within a community. This pipeline is divided into three sections that perform different functions: 1) Calculating bin abundances (abundance), 2) Identifying metabolic pathways present in each bin (metabolism), and 3) Finding commonalities between phages and their hosts (host-virus). Multiple requirements are needed and are specified in the “Requirements” section of this protocol.

MetaGaia installation

The MetaGaia software is available as an open-source package distributed from a GitHub repository. Thus, the natural way of installing it is by cloning the repository via the following commands:

git clone https://github.com/valdeanda/MetaGaia

Python installation requirements (Linux)

To install python, run sudo apt-get install python3. To install pip, run sudo apt install python3-pip. To install pandas, run pip install pandas. To install numpy, run pip install numpy. To install seaborn, run pip install seaborn.

Install dependencies with Conda (Linux)

To install all of the above dependencies with Conda using the supplied .yml file, run conda env create -f metagaia_conda_noarch.yml

MetaGaia requirements

NOTE: Please do NOT use symbolic links for fastq or fna files. Also, make sure headers are present and properly separated in input files.

Calculating bin abundances (abundance)

All necessary fastq files have been downloaded and placed within the same directory. All files with the .fastq.gz extension in a directory will be analyzed.
All necessary fna files have been downloaded and placed within the same directory.
All necessary depth text files have been downloaded and placed within the same directory or a concatenated depth file has already been created.
A file has been created that maps each bin to its respective sample site (see “Mapping bins to sample sites” for more information).
The GTDBTK tool has been downloaded to map each bin to its respective taxa (see “Mapping taxa” for more information).

Identifying metabolic pathways (metabolism)

The mapping_file.tsv file created from the “bin_abundance” section of the pipeline.
An account at the Integrated Microbial Genomes with Microbiome Samples system IMG is needed with the IMG annotation of all of the samples being analyzed.
File to combine annotation files. (What is this file?)
A user created file that contains the KEGG, COG, PFAM, and/or EC_NUMBER pathways of interest.

Finding commonalities between phages and their hosts (host-virus)

The “database_counts” (complete_metabolic_profile.tsv) file from the metabolic_profile.py script that contains only one of the following databases: KEGG, COG, PFAM, or EC_NUMBER.
The output from Vibrant that maps phage scaffolds to metabolic pathways.

Running abundance

Creating bin abundance input files

NOTE: To create the bin_sample file, read the “Mapping bins to samples” section below. To create the depth file, read the “Create initial depth file” section below as well.

To create all the input files necessary, run the files_prep.py script. Two input files (concatenated depth file is optional) are necessary and the directories where the fastq and fna (and optional depth) files are located are needed to run this script (see “Download fastqs” for more information on where to obtain these). All fastq files must end with “.fastq.gz”. One of the depth arguments must be present: depth or depth_dir. These are the arguments necessary to run this script:

usage: files_prep.py [-h] -b BIN_SAMPLES [-d DEPTH] [-f DEPTH_DIR] -q FASTQ_DIR -a FNA_DIR

optional arguments:
  -h, --help            show this help message and exit
  -b BIN_SAMPLES, --bin_samples BIN_SAMPLES
                        Input file containing 2 columns: 1st column contains bin names and 2nd column contains sample names. Requires header for each column, "Bin" and "Sampling_Site"
                        respectively. File must have tsv, csv, or txt file extension.
  -d DEPTH, --depth DEPTH
                        Input file containing 5 columns: contigName, contigLen, totalAvgDepth, Sample_Depth, and Depth. This file can be created with jgi_summarize_bam_contig_depths. File must
                        have tsv, csv, or txt file extension.
  -f DEPTH_DIR, --depth_dir DEPTH_DIR
                        Input directory path containing all the depth txt files that must be concatenated. No other txt files should be present within the directory!
  -q FASTQ_DIR, --fastq_dir FASTQ_DIR
                        Input directory path containing all the fastq files.
  -a FNA_DIR, --fna_dir FNA_DIR
                        Input directory path containing all the fna files.

Example of concatenated depth file (NOTE: Please make sure the scaffold names in the depth file(s) contain the unique identifier corresponding to the bin that the scaffold originates from. The Sample names in the depth file may need to be edited by the user to match the Sample names in the reads_file.tsv. Also, the “Original_Contig_Name” column in both the depth and mapping files must match by appending the respective “Sampling_Site” prefix to each scaffold in the depth file prior to running the script [See “Create initial depth files” for more information]. WARNING: Samples without depth values are removed if present.):

head depth_file.tsv

contigName	contigLen	TotalAvgDepth	SF_Jul11_USGS_4_1-sorted.bam
S4B_scaffold_41967_c1	2430.0	15.7917	15.7917
S4B_scaffold_22085_c1	3249.0	12.081	12.081
S4B_scaffold_39665_c1	2489.0	9.998289999999999	9.998289999999999
S4B_scaffold_36461_c1	2583.0	8.70695	8.70695
S4B_scaffold_62858_c1	2036.0	11.1448	11.1448
S4B_scaffold_47653_c1	2298.0	10.1196	10.1196
S4B_scaffold_28262_c1	2900.0	8.95927	8.95927

Example of bin_sample file (“Sampling_Site” is a column with unique identifiers that identifies the sample where each scaffold originates)

head bin_sample.tsv

Bin	Sampling_Site
NoBin	S0A
13_Jan_SF_Bin44	S4B
14_Jan_SF_Bin36	S5A
21_Jan_SF_Bin21	S2C
24_Jan_SF_Bin43	S4D

Here is an example of how to run this script:

python3 files_prep.py -b ../../data/example_input_files/files_prep/bin_sample.tsv -d ../../data/example_input_files/files_prep/depth_file.tsv -q ~/fastq_files/ -a ~/fna_files/

or (if depth files are not already concatenated):

python3 files_prep.py -b ../../data/example_input_files/files_prep/bin_sample.tsv -f ~/depth_files/ -q ~/fastq_files/ -a ~/fna_files/

There should be four files named “reads_file.tsv” (Sample names may need to be edited by user), “mapping_file.tsv”, “depth_file”, and “binsize_file.tsv” saved in the output folder.
From this output, you need to rename the reads_file.tsv and depth_file.tsv. These files will look like this:

head reads_file.tsv

Sample	Reads
/home/mlangwig/Working_Dir/SF_Bay/fastq/SF_Jul11_sed_USGS_4_1_filter-METAGENOME.fastq.gz	135081488
/home/mlangwig/Working_Dir/SF_Bay/fastq/SF_Jan12_sed_USGS_4_1_filter-METAGENOME.fastq.gz	140529442

head depth_file_renamed.tsv

Original_Contig_Name	contigLen	totalAvgDepth	Sample	Depth
S4C_scaffold_7529_c1	6042.0	23.1232	SF_Jul11_USGS_4_1-sorted.bam	23.1232
S4C_scaffold_8778_c1	5493.0	9.78027	SF_Jul11_USGS_4_1-sorted.bam	9.78027

Rename the reads_file.tsv and depth_file.tsv so that the Sample column from the reads_file.tsv matches the Sample column in the depth_file.tsv. We could not automate this because different projects have unique and specific sample names :

head reads_file_renamed.tsv

Sample	Reads
SF_Jul11_sed_USGS_4_1	135081488
SF_Jan12_sed_USGS_4_1	140529442

head depth_file_renamed.tsv

Original_Contig_Name	contigLen	totalAvgDepth	Sample	Depth
S4C_scaffold_7529_c1	6042.0	23.1232	SF_Jul11_USGS_4_1	23.1232
S4C_scaffold_8778_c1	5493.0	9.78027	SF_Jul11_USGS_4_1	9.78027

The following examples show how you could potentially rename these files:

To rename the Sample column in the reads_file.tsv, example

sed 's/\/fastq\//\t/g' reads_file.tsv | cut -f2,3 | sed 's/_filter-METAGENOME.fastq.gz//g' | sed 's/Reads/Sample\tReads/g' | sed 's/sed_//g' > reads_file_renamed.tsv

To rename the Sample column in the depth_file.tsv, example:

View unique names that need to be renamed

cut -f3 depth_file.tsv | sort | uniq

Rename

sed 's/-sorted.bam//g' depth_file.tsv | sed 's/\.fasta.2000kb//g' > depth_file_renamed.tsv

Make sure the names in the Sample colum of depth_file_renamed.tsv match the names of Sample in reads_file_renamed.tsv

cut -f3 depth_file_renamed.tsv | sort | uniq

cut -f1 reads_file_renamed.tsv | sort

Calculate relative bin abundance

The bin_abundance.py script then takes in a list of input files (processed from by running the files_prep.py script) that need to be created beforehand. Here are the descriptions of each of the input files:
- make sure headers of input files are tab separated
- make sure are giving absolute paths to files (no symbolic links)
- make sure readablenum is large enough to convert all RelativeAbundance values to greater than 1

usage: bin_abundance.py [-h] -r READS -m MAPPING -d DEPTH -s BINSIZE
                        [-n READABLENUM]

optional arguments:
  -h, --help            show this help message and exit
  -r READS, --reads READS
                        Total number of reads. Rows are samples column are
                        reads
  -m MAPPING, --mapping MAPPING
                        Tabular file containing Original_Contig_Name Bin Sample
  -d DEPTH, --depth DEPTH
                        Tabular file depth info Original_Contig_Name contigLen
                        Sample_Depth Depth
  -s BINSIZE, --binsize BINSIZE
                        Tabular file with Bin and corresponding Genome size
                        (bp) as columns
  -n READABLENUM, --readablenum READABLENUM
                        A large number to multiply the relative abundances so
                        that it is human readable.

Here is an example of how to run the script (using example data provided in the data folder):

python3 bin_abundance.py -r ../../data/example_input_files/bin_abundance/reads_file_renamed.tsv -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d ../../data/example_input_files/bin_abundance/depth_file_renamed.tsv -s ../../data/example_input_files/bin_abundance/binsize_file.tsv -n 1000000

The output will be a file saved in the output directory. This file can then be used in the next script: bin_abundance_viz.py to visualize the data.

Visualize relative bin abundance

ALL input files for this step should be in either tsv or csv format.

With the bin_abundance_viz.py script, the files outputted from bin_abundance.py can be used to visualize the relative bin abundances in addition to a files mapping each bin to their respective taxa (see “Mapping taxa” for more information) and each sample to their respective site (see “Mapping sample to site” for more information). These are the arguments needed for the visualization script:

usage: bin_abundance_viz.py [-h] -b BIN_ABUNDANCE -t TAXONOMY_INFO -s
                            SAMPLE2SITE [-p PERCENT] [-w WIDTH] [-l HEIGHT]
                            [-d DPI] [-o OUT_FIG] [-c TAXA_COLOR]

optional arguments:
  -h, --help            show this help message and exit
  -b BIN_ABUNDANCE, --bin_abundance BIN_ABUNDANCE
                        Input tsv file outputted from the bin_abundance.py
                        script with file extension.
  -t TAXONOMY_INFO, --taxonomy_info TAXONOMY_INFO
                        Input tsv or csv file mapping taxonomy to each bin
                        with file extension.
  -s SAMPLE2SITE, --sample2site SAMPLE2SITE
                      Input tsv or csv file containing each sample to the
                      site it was taken from with file extension [""].
  -p PERCENT, --percent PERCENT
                        Percent of highest sample in each bin [10].
  -w WIDTH, --width WIDTH
                        Width of outputted clustermap figure [4].
  -l HEIGHT, --height HEIGHT
                        Height of outputted clustermap figure [5].
  -d DPI, --dpi DPI     Resolution for output figure file [300].
  -o OUT_FIG, --out_fig OUT_FIG
                        Stores the figure in the specified file path and
                        format [test.png].
  -c TAXA_COLOR, --taxa_color TAXA_COLOR
                        Input tsv or csv file containing the RGB color code
                        for each taxa with file extension [""].

Example of bin_abundance file:

head bin_abundance.tsv

Bin	Sample	Sampling_Site	RelativeAbundance	RelativeAbundanceReadable
13_May_SF_Bin93	SF_May12_USGS_13	S1B	6.6878693485613E-08	6.687869348561300
8_1_Oct_SF_Bin28	SF_Oct11_USGS_8_1	S5D	5.76751058810138E-08	5.767510588101380
4_1_Jan_SF_Bin44	SF_Jan12_USGS_4_1	S4A	1.52873009442108E-07	15.28730094421080
24_July_SF_Bin15	SF_Jul11_USGS_24	S3C	9.88355407641773E-08	9.883554076417730

Example of taxonomy_info file:

head taxonomy_info.tsv

Bin	Taxa
13_Jan_SF_Bin36	Pacearchaeota
13_Jan_SF_Bin40	Thaumarchaeota_Nitrosopumilus
13_Jan_SF_Bin52	Thaumarchaeota_Nitrosopumilus
13_Jan_SF_Bin61	Thermoplasmata

Example of sample2site file - the Sample names should match the names in the Sample column of reads_file.tsv and depth_file_renamed.tsv; the Site names are the physical location’s name/code that each sample was taken from; each Sample should contain a unique Site name:

head sample2site.tsv

Sample	Site
SF_Jan12_USGS_1	21
SF_Jan12_USGS_2	24
SF_Jul11_USGS_3	13
SF_Jul11_USGS_1	21

Example of taxa_color file:

head taxa_color.tsv

#a3c1ad
#ede4d9
#426ca6
#ffdab9
#fff5e5
#ffefd5

Here is an example of how to run this script:

python3 bin_abundance_viz.py -b ../output/bin_abundance_output.tsv -t ~/tax2bin.tsv -p 10 -w 8 -l 5 -d 500 -o figure1.jpg -c ~/taxa_color.tsv

Outputs will be saved in the output directory containing the user-specified name.

Running metabolism

NOTE: The only metabolic pathways that are available to analyze are in the KEGG, COG, PFAM, and EC_NUMBER databases.

Map metabolic pathways to the scaffold containing them and get the counts of each pathway per bin

One requirement must be met before running the metabolic_profile.py script, which is that each of the IMG annotation files must be downloaded (see “Creating IMG annotation” for more information) and contain the “Original_Contig_Name” column (created by merging the IMG file with a user-created mapping file containing the columns: “IMG_Contig_Name” and “Original_Contig_Name”).
To get the pathways mapped to each scaffold and the number of times each pathway is found within a bin, run the metabolic_profile.py script. These are the arguments for this script:

usage: metabolic_profile.py [-h] [-i IMGANNO_FILE] [-p IMGANNO_PATH] -m MAPPING -d DATABASE [-c CONSISTENCY]

optional arguments:
  -h, --help            show this help message and exit
  -i IMGANNO_FILE, --imganno_file IMGANNO_FILE
                        Input file or files in tsv format. Rows are genes columns are IMG annotations.
  -p IMGANNO_PATH, --imganno_path IMGANNO_PATH
                        Input path containing only the IMG files. Rows are genes columns are IMG annotations in each file.
  -m MAPPING, --mapping MAPPING
                        A tsv file containing original contig name, sample, and bin columns. Created from the files_prep.py script.
  -d DATABASE, --database DATABASE
                        Database(s) of interest to merge together. Input as a list.
  -c CONSISTENCY, --consistency CONSISTENCY
                        Boolean value that determines if scaffolds not containing a value for each database should be kept. Leave blank if consistency check is not needed.

Example of imganno file:

head imganno1.tsv

Original_Contig_Name	IMG_Contig_Name	Locus_Tag	IMG_Gene_ID	Gene_Type	Gene_Start	Gene_Stop	Gene_Length	Homolog_Gene_ID	Homolog_Taxon_ID	Lineage_%ID
S1A_scaffold_1_c2	Ga0224507_10000001	Ga0224507_100000011	Ga0224507_10000001.1	CDS	2 805	803	641316135	641228499	65.06	Archaea;Thaumarchaeota;unclass
S1A_scaffold_1_c2	Ga0224507_10000001	Ga0224507_100000012	Ga0224507_10000001.3	CDS	943	1254	311	2518759324	2518645532	79.61
S1A_scaffold_1_c2	Ga0224507_10000001	Ga0224507_100000013	Ga0224507_10000001.5	CDS	1243	1911	668	2518890842	2518645576	82.81
S1A_scaffold_1_c2	Ga0224507_10000001	Ga0224507_100000014	Ga0224507_10000001.7	CDS	1945	3255	1310	641316132	641228499	88.97
S1A_scaffold_1_c2	Ga0224507_10000001	Ga0224507_100000015	Ga0224507_10000001.9	CDS	3248	3595	347	2518890840	2518645576	72.17

Example of mapping file:

head mapping_file.tsv

Original_Contig_Name	Bin	Sampling_Site
S1B_scaffold_1283_c1	13_May_SF_Bin93	S1B
S1B_scaffold_1516_c1	13_May_SF_Bin93	S1B
S1B_scaffold_1603_c1	13_May_SF_Bin93	S1B
S1B_scaffold_1628_c1	13_May_SF_Bin93	S1B

Here is an example of how to run this script with a single IMG annotation file:

python3 metabolic_profile.py -i imgannotation.tsv -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d [KEGG, COG, PFAM, EC_NUMBER] -c True

Here is an example of how to run this script with multiple IMG annotation files:

python3 metabolic_profile.py -p /img_annofiles/ -m ../../data/example_input_files/bin_abundance/mapping_file.tsv -d [KEGG, COG, PFAM, EC_NUMBER] -c True

Outputs will be saved in the output directory containing name formats: “mapped_scaffolds.tsv” and the database name followed by “complete_metabolic_profile.tsv”. NOTE: Old files will NOT be removed as each output is uniquely named.

Extract specific pathways to analyze

One requirement must be met to run the pathway_extraction.py script, which is that the user-defined database must be created (see “Defining metabolic database” for more information).
To extract and analyze specific metabolic pathways, run the pathway_extraction.py script. These are the arguments for the script:

usage: pathway_extraction.py [-h] -c CUSTOMDATA -p PATHWAY_DATABASE -d DATABASE

optional arguments:
  -h, --help            show this help message and exit
  -c CUSTOMDATA, --customdata CUSTOMDATA
                        Input file in tsv format with at least one column with the header containing the pathways of interest. The headers of the metabolic pathway must correspond with its
                        respective database (eg. KEGG, COG, PFAM, or EC_NUMBER).
  -p PATHWAY_DATABASE, --pathway_database PATHWAY_DATABASE
                        Input file in tsv format that contains the counts of each pathways per bin.
  -d DATABASE, --database DATABASE
                        Database(s) of interest to merge together. Input as a list.

Example of customdata file (can have other descriptive columns if needed):

head custom_database.tsv

Pathway	Subpathway	KEGG	COG	PFAM	Description
Nitrogen	Annamox	K20932	COG3391	PF18582	K20932; hydrazine synthase subunit [EC:1.7.2.7]
Nitrogen	Annamox	K20933	COG1858		K20933; hydrazine synthase subunit [EC:1.7.2.7]
Nitrogen	Annamox	K20934	COG0823		K20934; hydrazine synthase subunit [EC:1.7.2.7]
Nitrogen	Annamox	K20935			hdh; hydrazine dehydrogenase [EC:1.7.2.8]

Example of pathway_database file:

head final_merged_metabolic_profile.tsv

KEGG	13_Jan_SF_Bin1	13_Jan_SF_Bin12	13_Jan_SF_Bin13	13_Jan_SF_Bin21
KO:K00001	0.0	0.0	0.0	0.0
KO:K00002	1.0	0.0	0.0	0.0
KO:K00003	2.0	1.0	1.0	1.0
KO:K00004	0.0	0.0	0.0	0.0
KO:K00005	0.0	0.0	0.0	0.0
KO:K00006	0.0	0.0	0.0	0.0
KO:K00008	0.0	0.0	0.0	0.0

Here is an example of how to run this script:

python3 pathway_extraction.py -c user-defined_database.tsv -p ../../output/final_merged_metabolic_profile.tsv -d [KEGG, COG, PFAM, EC_NUMBER]

Outputs will be places in the output directory containing the name “extracted_pathways.tsv”.

Running host-virus

Compare phage and host metabolisms to identify commonalities

There are two requirements needed to run this script: 1) the “database_counts” (complete_metabolic_profile.tsv) file outputted using the metabolic_profile script from the “metabolism” portion of this pipeline and 2) the Vibrant output file mapping scaffolds to metabolic pathways.
To analyze the unique and shared metabolic pathways between the two organisms, run the amg_hostvi.py script. These are the arguments for the script:

usage: amg_hostvi.py [-h] -p PATHWAY_DATABASE [-vf VIBRANT_FILE] [-vp VIBRANT_PATH] -d DATABASE

optional arguments:
  -h, --help            show this help message and exit
  -p PATHWAY_DATABASE, --pathway_database PATHWAY_DATABASE
                        Input file in tsv format that contains the counts of each pathway per bin.
  -vf VIBRANT_FILE, --vibrant_file VIBRANT_FILE
                        Input file from Vibrant in tsv format that contains phage scaffolds mapped to metabolic pathways.
  -vp VIBRANT_PATH, --vibrant_path VIBRANT_PATH
                        Input path from Vibrant files in tsv format that contains phage scaffolds mapped to metabolic pathways.
  -d DATABASE, --database DATABASE
                        Input only one of the following database names the user is interested in analyzing: KEGG or PFAM.

Example of pathway_database file:

head final_merged_metabolic_profile.tsv

KEGG	13_Jan_SF_Bin1	13_Jan_SF_Bin12	13_Jan_SF_Bin13	13_Jan_SF_Bin21
KO:K00001	0.0	0.0	0.0	0.0
KO:K00002	1.0	0.0	0.0	0.0
KO:K00003	2.0	1.0	1.0	1.0
KO:K00004	0.0	0.0	0.0	0.0
KO:K00005	0.0	0.0	0.0	0.0
KO:K00006	0.0	0.0	0.0	0.0
KO:K00008	0.0	0.0	0.0	0.0

Example of vibrant_output file

head vibrant_output.tsv

protein	scaffold	AMG KO	AMG KO name	Pfam	Pfam name
scaffold_1527_10	scaffold_1527	K03639	“moaA, CNX2; GTP 3’,8-cyclase [EC:4.1.99.22]”	PF04055.21	Radical SAM superfamily
scaffold_1527_19	scaffold_1527	K01939	“purA, ADSS; adenylosuccinate synthase [EC:6.3.4.4]”	PF00709.21	Adenylosuccinate synthetase
scaffold_224_117	scaffold_224	K01939	“purA, ADSS; adenylosuccinate synthase [EC:6.3.4.4]”	PF00709.21	Adenylosuccinate synthetase
scaffold_224_158	scaffold_224	K22227	“ahbD; heme synthase”	PF13186.6	Iron-sulfur cluster-binding domain

Here is an example of how to run this script with one file:

python3 amg_hostvi.py -p ../../output/complete_metabolic_profile.tsv -vf ~/vibrant_output.tsv -d KEGG

Here is an example of how to run this script with multiple files in a directory:

python3 amg_hostvi.py -p ../../output/complete_metabolic_profile.tsv -vp ~/vibrant_outputs/ -d KEGG

Outputs will be placed in the output directory containing the names “phage_host_metabolism.tsv” and “phage_host_mapping.tsv”.

Additional commands

Download fastqs

Follow the download instructions found on IMG.

Mapping bins to sample site

The user must manually map each bin to its respective sample location.

Download and annotate genomes using IMG

Annotate metagenomes at the IMG JGI platafform

After completion, you will download a directory of each of your metagenomic samples containing the following subdirectories:

IMG Data: Contanining the following files

207502.assembled.COG
207502.assembled.crisprs
207502.assembled.EC
207502.assembled.faa
207502.assembled.fna
207502.assembled.gff
207502.assembled.KO
207502.assembled.last.blasttab
207502.assembled.names_map
207502.assembled.Pfam.blout
207502.assembled.phylodist
207502.assembled.product_names
207502.assembled.unsorted.Pfam.hmmout
3300033388

Metagenome Report Tables: containing the following files

Table_6_-_207017.annotation_parameters.txt
Table_7_-_3300033141.metagenome_properties.txt
Table_8_-_3300033141.taxonomic_composition.txt
Table_9_-_3300033141.functional_diversity.txt

Create initial depth files

After you have annotated and downloaded your genomes using IMG’s JGI platform, there are two possible methods to create the initial depth txt files. The following two methods use BWA and JGI, respectively.

BWA method

Here is an example of how to create the depth txt files using BWA:

Prep samples one at time for mapping.

#Copy fq files from yggshare to folder
cp 4998_22_23_H2_S10_fastp_bb.fastq.gz /home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/

#Run mdsum
for i in *.gz ; do md5sum $i > $i.md5sum ; done

Run script example for one sample below.

#STARTED 10:39 4/5/2021
#!/bin/bash
fq=/home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/4998_22_23_H2_S10_fastp_bb.fastq.gz
assembly=/home/keappler/07_Lab_Protocols/Binning/Guaymas_2018_Samples/mapping_bwa/2.5kb/renamed/2.5kb_4998_22_23_H2_S10.headers.removed.fa

#Index contigs >=2.5kb assembly subset created in previous steps

bwa index $assembly
echo "Done index"

# Align contigs >=2.5kb against concatenated, interleaved, and trimmed R1 and R2
bwa mem -t 70 $assembly -p $fq > 4998_22_23_H2_S10.sam
echo "Done mapping"

#samtools create bam files- needed for binning
samtools view -bS -F 4 -@10 4998_22_23_H2_S10.sam -o 4998_22_23_H2_S10.bam
echo "Done bam"

#samtools create sorted bam files- needed for jgi_summarize_bam_contig_depths
samtools sort -@10 4998_22_23_H2_S10.bam -o 4998_22_23_H2_S10_sorted.bam
echo "Done sorted bam"

#The file to put the contig by bam depth matrix > depth.txt
#Do not include variance from mean depth along the contig
jgi_summarize_bam_contig_depths  --outputDepth depth.txt 4998_22_23_H2_S10_sorted.bam --noIntraDepthVariance
echo "Done depth file"

Run md5sum and delete .sam file (large) before starting next sample.

for i in *.sam ; do md5sum $i > $i.md5sum ; done

Repeat process for each sample.

JGI method

The script, jgi_summarize_bam_contig_depths.cpp, must be used to create the depth file needed in files_prep.py and can be found here along with its usage notes.

Here is an example of how to create the depth txt files using JGI:

Index bin.

bwa index 13_Jan_SF_Bin5.fna

BWA MEM to obtain SAM files.

for sample in `cat names.txt`; do bwa mem -t 30 13_Jan_SF_Bin5.fna -p /home/mlangwig/Working_Dir/SF_Bay/fastq/$sample.fastq.gz > $sample.sam; done

Samtools view to convert SAM to BAM.

for sample in `cat names.txt`; do samtools view -bS -F 4 -@10  $sample.sam -o $sample.bam; done

Obtain sorted bams.

for sample in `cat names.txt`; do samtools sort -@10 $sample.bam -o $sample-sorted.bam; done

Use sorted bams to get Cov_ files for mmgenome2.

jgi_summarize_bam_contig_depths --outputDepth depth.txt *sorted.bam --noIntraDepthVariance

Append “Sampling_Site” prefix to all the scaffolds in a sample. “Sampling_Site” is a column with unique identifiers that identifies the sample where each scaffold originates.
Repeat process for each sample.

Mapping taxa

One way to map each the taxa to each bin is to use the GTDBTK tool found here: https://github.com/Ecogenomics/GTDBTk.

Creating IMG annotation

First, log on to the IMG server found here. Then, download and unzip the tar.gz file. Once you have the IMG annotation files, run the consolidateJGIdata.pl script to consolidate the data obtained from JGI IMG into one file. The script can be found here.

Defining metabolic database

When creating a user-defined database, there are multiple format requirements. The metabolic databases of interest should contain one or multiple headers (KEGG, COG, PFAM, EC_NUMBER) depending on which database the user wants to analyze. All other auxiliary columns will be mapped to the bins and added to the final outputted file. Lastly, in each of the database columns, please format each pathway name as it is in the IMG annotation file:

KEGG - KO:K12345
COG - COG12345
PFAM - pfam12345
EC_NUMBER - EC:EC12345

Creating Vibrant mapping

To create the Vibrant mapping file, please follow the instructions here. When using Vibrant, please place the sample name of where each scaffold is found within the file name between specific characters (eg. “VIBRANT_AMG_individuals_SF_Jul11_USGS_4_1_scaffold_2kb.tsv”).

Support and Development

This pipeline has been developed by the Baker lab at UT Austin. If there are any questions, comments, or concerns, please email us at either sahilbshah@berkeley.edu or valdeanda@utexas.edu.

Cite

If you find this software useful please cite us as:

Coming soon!