SARS-CoV-2 Wastewater Surveillance Tutorial#
The occurence of SARS-CoV-2 in wastewater can be used as an early warning system for the presence of the virus in a community. The virus is shed in the stool of infected individuals, and can be detected in wastewater before clinical cases are reported. This can be particularly useful for monitoring the presence of new variants of the virus, which may be more transmissible or resistant to vaccines. To monitor occurrence of SARS-CoV-2 variants in waste water, the RNA of the virus is amplified with PCR, and the viral genome is sequenced to identify specific mutations that are characteristic of different variants. This tutorial introduces the type of analysis that is required to translate raw sequencing data into epidemiological information as part of the national surveillance program of SARS-CoV-2 variants in wastewater.
The data for this tutorial is from [Bagutti and Hug, 2023].
Setting up your work directory#
The tutorial assumes that you have installed V-pipe using the quick install installation documentation, and that the workflow is setup with the following structure:
vp-analysis
├── V-pipe
├── mambaforge
└── work
vp-analysisis the main directory where you have installed V-pipeV-pipeis the directory with V-pipe’s own codemambaforgehas dependencies to start using V-pipe (bioconda, conda-forge, mamba, snakemake)workis the directory where you have performed the test analysis
Preparing the input data#
In addition to the raw sequencing data (fastq.gz) files, we also need to provide information about the SARS-CoV-2 variants we would like to detect. For this tutorial all required input files are provided in the correct format. This set up can serve as an example for your own data.
Set up the working directory#
We provide the test data for this tutorial as a tarball that you can download and extract and will result in a directory called work_cowwid:
cd vp-analysis/
wget https://vpipe-gsod.s3.eu-central-1.amazonaws.com/cowwid_tutorial.tar.gz
tar -xvf cowwid_tutorial.tar.gz
This will download and create a directory with files that you would need to provide. More information about how to set this up for your own data can be found at organizing data.
After preparing the input files, you can initiate the project with:
cd work_cowwid
../V-pipe/init_project.sh
The above two steps will result in the following directory structure:
work_cowwid
├── config.yaml
├── deconv_linear_logit_quasi_strat.yaml
├── samples
│ └── sample1
│ ├── 2021-11-15
│ │ └── raw_data
│ │ ├── Ba210453_2021-11-15_R1.fastq
│ │ └── Ba210453_2021-11-15_R2.fastq
│ ├── ...
│ |
│ ├── 2021-12-28
│ │ └── raw_data
│ │ ├── Ba210487_2021-12-28_R1.fastq
│ │ └── Ba210487_2021-12-28_R2.fastq
│ └── 2021-12-29
│ └── raw_data
│ ├── Ba210488_2021-12-29_R1.fastq
│ └── Ba210488_2021-12-29_R2.fastq
├── samples.tsv
├── timeline.tsv
└── vpipe
In addition to the raw fastq files, we have prepared samples.tsv, timeline.tsv (more about timeline.tsv later) and the deconvolution configuration file: deconv_linear_logit_quasi_strat.yaml. Here are the first few lines of samples.tsv:
sample1 2021-11-15 251 v41
sample1 2021-11-16 251 v41
sample1 2021-11-17 251 v41
sample1 2021-11-18 251 v41
Note that samples.tsv contains the read length in the third column and the protocol used for PCR amplification and sequencing in the fourth column. The identifier v41 depicts that our raw fastq files were generated according to the ARTIC V4.1 nCov-2019 primers. The required information for this primer set is part of the V-pipe repository, and can be found in resources/sars-cov2/primers/v41.
Prepare Variant of Concern (VOC) data#
V-pipe requires the information for each variant to be stored in a specific yaml format. The yaml file consists at least of two parts: the variant information and the list of mutations that are characteristic of the variant. The following is a minimal (and shortened) example of the yaml file for the delta variant:
variant:
voc: 'VOC-21APR-02'
who: 'delta'
short: 'de'
pangolin: 'B.1.617.2'
mut:
210: 'G>T'
241: 'C>T'
3037: 'C>T'
4181: 'G>T'
6402: 'C>T'
These yaml files are available from the COJAC GitHub repository for most of the variants that are currently of interest. For our tutorial, we will download the yaml files for the delta, omicron BA.1, and omicron BA.2 variants. For this make a directory called vocs in your work directory (so in work_cowwid):
mkdir vocs
And download the yaml files for the delta, omicron BA.1, and omicron BA.2 variants:
cd vocs
wget https://raw.githubusercontent.com/cbg-ethz/cojac/master/voc/delta_mutations_full.yaml
wget https://raw.githubusercontent.com/cbg-ethz/cojac/master/voc/omicron_ba1_mutations_full.yaml
wget https://raw.githubusercontent.com/cbg-ethz/cojac/master/voc/omicron_ba2_mutations_full.yaml
These yaml files are used to define the mutations that are characteristic of the delta, omicron BA.1, and omicron BA.2 variants. The yaml files can also be created with the help of the COJAC tool, which queries the Cov-Spectrum database to identify the mutations that are characteristic of each variant. Follow the preparing VOC configuration to learn how to use other resources to generate the yaml files for the delta, omicron BA.1, and omicron BA.2 variants.
Run COJAC#
The main feature of COJAC is to detect the presence of a certain variant based on the presence of a combination of mutations. In order to specify the input files and output charactertistics, let’s change the config.yaml file:
general:
virus_base_config: 'sars-cov-2'
input:
samples_file: samples.tsv
variants_def_directory: vocs/
output:
trim_primers: true
You can do this with the following command (but of course also by simply using copy-paste):
cat <<EOT > config.yaml
general:
virus_base_config: 'sars-cov-2'
input:
samples_file: samples.tsv
variants_def_directory: vocs/
output:
trim_primers: true
EOT
After setting the configuration, we can run V-pipe to do the preprocessing, and run COJAC to detect the presence of the variants:
cd ..
./vpipe --cores 8 allCooc results/cohort_cooc_report.v41.csv
This will generate several COJAC output files, including results/cohort_cooc.v41.csv, which contains the co-occurrence information for each sample, here’s an example of the first few lines of the file:
sample |
batch |
amplicon |
frac |
cooc |
count |
mut_all |
mut_oneless |
om2 |
de |
om1 |
|---|---|---|---|---|---|---|---|---|---|---|
sample1 |
15.11.2021 |
1 |
1 |
2 |
184 |
184 |
1 |
1 |
||
sample1 |
15.11.2021 |
30 |
0.95049505 |
2 |
101 |
96 |
3 |
1 |
||
sample1 |
15.11.2021 |
37 |
0.9408284 |
2 |
169 |
159 |
7 |
1 |
||
sample1 |
15.11.2021 |
73 |
1 |
2 |
13 |
13 |
0 |
1 |
||
sample1 |
15.11.2021 |
76 |
1 |
2 |
15 |
15 |
64 |
1 |
||
sample1 |
15.11.2021 |
78 |
1 |
2 |
219 |
219 |
8 |
1 |
All columns are explained in the COJAC documentation:
count: total count of amplicons carrying the sites of interest
mut_all: amplicons carrying mutations on all site of interest (e.g.: variant mutations observed on all sites)
mut_oneless: amplicons where one mutation is missing (e.g.: only 2 out of 3 sites carried the variant mutation, 1 sites carries wild-type)
frac: fraction (mut_all/count) or empty if no counts
cooc: number of considered site (e.g.: 2 sites of interests) or empty if no counts
In the above example, we see that we have evidence of the delta variant in the sample from 15.11.2021 at amplicon 1, 30, 37, 73, 76, and 78.
LolliPop#
Now that we have evidence for the presence of variants, we can use LolliPop to answer the question: in which relative proportions are the variants in the water?
Timeline#
Because Lollipop performs a time-series analysis, we need to provide information on the date of sampling. In this tutorial we do this with a timeline file. For more information and alternative methods see Specifying timeline and location information. The timeline file should contain the date of each sample, and the location where the sample was taken. This file contains the same information as the samples.tsv file, but with the addition of the location of the sample. An example of timeline.tsv for the first few samples of our dataset would be:
sample batch reads proto location_code date location
sample1 2021-11-15 251 v41 Ba 2021-11-15 Basel (BS)
sample1 2021-11-16 251 v41 Ba 2021-11-16 Basel (BS)
sample1 2021-11-17 251 v41 Ba 2021-11-17 Basel (BS)
…
Note
Note that:
The timeline tsv contains a header line (
samples.tsvdoes not).In addition to the first four columns of
samples.tsv, onlylocationanddateare necessary for LolliPop.
Now we need to provide the timeline.tsv file in config.yaml at tallymut under timeline_file and provide the deconvolution configuration. Add these lines to config.yaml:
tallymut:
timeline_file: timeline.tsv
deconvolution:
deconvolution_config: deconv_linear_logit_quasi_strat.yaml
You can do this with the following command (but of course also by simply using copy-paste):
cat <<EOT >> config.yaml
tallymut:
timeline_file: timeline.tsv
deconvolution:
deconvolution_config: deconv_linear_logit_quasi_strat.yaml
EOT
Run LolliPop#
This command will run the deconvolution:
./vpipe --cores 8 deconvolution
Visualizing the results#
V-pipe will produce results/deconvoluted.tsv.zst, which is a Zstandard-compressed table that can be directly reads into pandas for further processing. We would like visualize that using python. First install the necessary packages:
cd ../../
# activate the base conda environment
. vp-analysis/*forge*/bin/activate ''
# create the environment
mamba create -n cowwid-plot -c conda-forge -c bioconda pandas matplotlib seaborn zstandard
# activate the environment
. vp-analysis/*forge*/bin/activate cowwid-plot
After installing you can use this python script that reads the deconvoluted table and plots the proportion of each variant over time.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# read ../results/deconvoluted.tsv.zst into a pandas dataframe
df = pd.read_csv('vp-analysis/work_cowwid/results/deconvoluted.tsv.zst', sep='\t')
# Filter the DataFrame
df_filtered = df[df['variant'] != 'undetermined']
# Set the plot style
sns.set_theme(style="whitegrid")
# Create the plot
plt.figure(figsize=(10, 6))
for variant, group_data in df_filtered.groupby('variant'):
plt.plot(group_data['date'], group_data['proportion'], label=variant)
plt.fill_between(group_data['date'], group_data['proportionLower'], group_data['proportionUpper'], alpha=0.1)
# Customize the plot
plt.xlabel('Date')
plt.ylabel('Proportion')
plt.legend(title='Variant')
plt.xticks(rotation=45)
plt.gca().xaxis.set_major_locator(plt.matplotlib.dates.WeekdayLocator(interval=1))
plt.title('Proportion of Variants Over Time')
plt.tight_layout()
# save as png
plt.savefig('vp-analysis/work_cowwid/proportion_of_variants_over_time.png')
# Show the plot
plt.show()
