---
jupyter:
  jupytext:
    cell_metadata_filter: -all
    formats: ipynb,md
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.3'
      jupytext_version: 1.14.5
  kernelspec:
    display_name: Python 3
    language: python
    name: python3
---

<!-- markdownlint-configure-file { "MD010": { "ignore_code_languages" : [ "tsv", "bash" ] } } -->
# SARS-CoV-2 Tutorial

This tutorial shows the basics of how to interact with V-pipe.

## Requirements

The tutorial assumes that you have installed V-pipe using the [quick install installation documentation](quick-install-v-pipe-and-conda), and that the workflow is setup with the following structure:

```text
vp-analysis
├── V-pipe
├── mambaforge
└── work
```

- `vp-analysis` is the main directory where you have installed V-pipe
- `V-pipe` is the directory with V-pipe's own code
- `mambaforge` has dependencies to start using V-pipe (bioconda, conda-forge, mamba, snakemake)
- `work` is the directory where you have performed the test analysis

## Setting up the work directory

We will create a fresh work directory for this tutorial. A V-pipe work directory typically contains the following files and directories:

- `config.yaml`: the configuration file. For example to tell V-pipe where to find the samples and the reference genome. All configuration options are described in the [configuration schema](configuring-the-workflow)
- `samples.tsv`: a tab-separated file listing the samples to be processed. The first two columns are mandatory and represent the hierarchical levels of the samples. The third and fourth column are optional and contain the read length and protocol name. 
- `vpipe`: a wrapper script to start the workflow
- `samples/`: the directory containing the raw data of the samples

And after running the workflow:

- `results/`: the results of the workflow
- `.snakemake`: the directory containing the snakemake working files

For your convenience, you can set up a boilerplate working directory with the script `init.sh`. This will copy a `config.yaml` and the `vpipe` wrapper script to get started: 

```bash
cd vp-analysis

mkdir -p work_sarscov2
cd work_sarscov2
../V-pipe/init_project.sh
```

## Preparing the dataset

In this section, we will prepare the dataset. We will download the reads from the SRA and prepare the directory structure as described in the [configuration documentation](organizing-data).

The dataset has been generated by [doi:10.1093/nsr/nwaa036](https://doi.org/10.1093/nsr/nwaa036), and is avaible from the sequence read archive (SRA): [SRR10903401](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10903401) and [SRR10903402](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10903402).

Below you can find a bash script that downloads the reads from the SRA and prepares the directory structure. Run these commands in the `work_sarscov2` directory.

```bash
download_reads () {
  # SRR ID
  SRR_ID=$1
  # path to the directory containing the reads on the ENA FTP
  FTP_PATH=$2

  # creating the raw_data directory
  mkdir -p samples/$SRR_ID/20200102/raw_data

  # downloading the reads from ENA
  # note that we rename the _1 suffix to _R1 and _2 to _R2
  curl \
  -o samples/$SRR_ID/20200102/raw_data/"${SRR_ID}"_R1.fastq.gz \
  ftp://ftp.sra.ebi.ac.uk/"$FTP_PATH"/"$SRR_ID"/"$SRR_ID"_1.fastq.gz

  curl \
  -o samples/$SRR_ID/20200102/raw_data/"${SRR_ID}"_R2.fastq.gz \
  ftp://ftp.sra.ebi.ac.uk/"$FTP_PATH"/"$SRR_ID"/"$SRR_ID"_2.fastq.gz
}

# executing the function for the two SRR IDs
download_reads SRR10903401 vol1/fastq/SRR109/001
download_reads SRR10903402 vol1/fastq/SRR109/002
```

The downloaded files should have the following structure:

```text
📁samples
├───📁SRR10903401
│   └───📁20200102
│       └───📁raw_data
│           ├───🧬SRR10903401_R1.fastq
│           └───🧬SRR10903401_R2.fastq
└───📁SRR10903402
    └───📁20200102
        └───📁raw_data
            ├───🧬SRR10903402_R1.fastq
            └───🧬SRR10903402_R2.fastq
```

## Configuration

In the `work_sarscov2`  directory you can find the file `config.yaml`. Open it in your editor and copy-paste the contents of the yaml file below in there. Our data is from the SARS-CoV-2 virus. Therefore, we can use the reference genome and annotation files provided by V-pipe, that we can specify at `general.virus_base_config`. In addition to that, we will set the `input.read_length` to 150 and `output.snv` to `true` to enable the SNV calling:

```yaml
general:
    virus_base_config: 'sars-cov-2'

input:
    samples_file: samples.tsv
    read_length: 150

output:
    trim_primers: false
    snv: true
    local: false
    global: false
    visualization: false
    diversity: false
    QA: false
    upload: false
    dehumanized_raw_reads: false
```

You can also get it from the V-pipe repository with:

```bash
cp ../V-pipe/docs/example_sarscov2_data/config.yaml .
```

## Running V-pipe

After having prepared the configuration, we can continue with a dry run to see what gets executed and to create the `samples.tsv` file:

```bash
./vpipe --dryrun --cores 2
```

As this is your first run of V-pipe, it will automatically generate the sample collection table (`samples.tsv`). Check `samples.tsv` in your editor. It is always a good idea check the content of the `samples.tsv` file, as it is used to collect the samples for the analysis. Of course, you can also provide `samples.tsv` yourself, before running the pipeline. If you did not use the expected directory structure, this file might end up empty or some entries might be missing. If so, you can safely delete it and re-run with option `--dry-run` to regenerate it. More information on the `samples.tsv` file can be found in the [documentation](setting-up-samplestsv).

Finally, we can run the V-pipe analysis. The first run will take a while because it will install all necessary software dependencies with conda:

```bash
./vpipe -p --cores 2
# -p and --cores (and all other options) are passed to snakemake. -p is for printing shell cmds. 
# takes a while to run, needs to install packages
```

```{note}
Note that `vpipe` is a wrapper for `snakemake`. All options that are passed to `vpipe` are options to snakemake. More information about snakemake options can be found in the [snakemake documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html).
```

## Output

The output of the SNV calling is aggregated in a standard [VCF](https://en.wikipedia.org/wiki/Variant_Call_Format) file, located in `results/`_​{hierarchy}​_`/variants/SNVs/snvs.vcf`, you can open it with your favorite VCF tools for visualisation or downstream processing.
It is also available in a tabular format in `results/​`_{hierarchy}​_`/variants/SNVs/snvs.csv`.

### Expected output

The small dataset that we used in this tutorial section has been analyzed by [doi:10.1093/nsr/nwaa036](https://doi.org/10.1093/nsr/nwaa036). The results of the original analysis (using bwa, samtools mpileup, and bcftools) are displayed in Table 2 in the article:

Using either the VCF or CSV files, compare with the results given out by V-pipe (with bwa and ShoRAH).

<!-- By browing to results/SRR10903401/20200102/variants/SNVs/snvs.csv I only find Shohrah output. Where's the BWA output? Or should I check alighments with IGV or something?   -->

<!-- In the vcf:
NC_045512.2	19164	.	C	T	100	PASS	Freq1=0.1666;Freq2=0.2404;Freq3=0.2298;Post1=1;Post2=1;Post3=1;Fvar=8;Rvar=9;Ftot=41;Rtot=44;Pval=1;Qval=1 -->

<!-- Ftot + Rtot = 41 + 44 = 85. In the paper: REF: 40	ALT: 12 = 52. Due to filtering?  -->

<!-- 1821, 26314 and 26590 not reported at all by ShoRAH.  -->

<!-- All together, this needs more context and explanation. -->


* For positions 19164 and 24323 of SRR10903401 and position 11563 of SRR10903402, we expect to see similar results in V-pipe.


* For the remaining positions (1821, 26314 and 26590 of SRR10903401), we expect that ShoRAH will consider the variants of poor quality and reject them because there is very little support ( <= than 5 reads supporting the alt).

<!-- This is documentation: -->