V-Pipe HIV Tutorial#

The present tutorial will show you how to apply V-pipe on HIV sequencing data. The data originate from the publication Abrahams et al. (2019), Science translational medicine 11.513 (DOI: 10.1126/scitranslmed.aaw5589). For testing the installation the same data is used. In this tutorial we will show you how to set up the configuration file, run the pipeline and interpret the output.

Requirements#

The tutorial assumes that you have installed V-pipe using the quick install installation documentation, and that the workflow is setup with the following structure:

vp-analysis
├── V-pipe
├── mambaforge
└── work
  • vp-analysis is the main directory where you have installed V-pipe

  • V-pipe is the directory with V-pipe’s own code

  • mambaforge has dependencies to start using V-pipe (bioconda, conda-forge, mamba, snakemake)

  • work is the directory where you have performed the test analysis

Setting up the work directory#

We will create a fresh work directory for this tutorial. A V-pipe work directory typically contains the following files and directories:

  • config.yaml: the configuration file. For example to tell V-pipe where to find the samples and the reference genome. All configuration options are described in the configuration schema

  • samples.tsv: a tab-separated file listing the samples to be processed. The first two columns are mandatory and represent the hierarchical levels of the samples. The third and fourth column are optional and contain the read length and protocol name.

  • vpipe: a wrapper script to start the workflow

  • samples/: the directory containing the raw data of the samples

And after running the workflow:

  • results/: the results of the workflow

  • .snakemake: the directory containing the snakemake working files

For your convenience, you can set up a boilerplate working directory with the script init.sh. This will copy a config.yaml and the vpipe wrapper script to get started:

cd vp-analysis

mkdir -p work_hiv
cd work_hiv
../V-pipe/init_project.sh

Preparing the dataset#

As described in configuration V-pipe expects the input samples to be organized in a two-level hierarchy. In the directory vp-analysis/V-pipe/documentation/example_HIV_data you can find a small dataset in the correct format that we will use in this tutorial. The files will have the following structure:

📁samples
├───📁CAP217
│   └───📁4390
│       └───📁raw_data
│           ├───🧬reads_R1.fastq
│           └───🧬reads_R2.fastq
└───📁CAP188
    │───📁4
    │   └───📁raw_data
    │       ├───🧬reads_R1.fastq
    │       └───🧬reads_R2.fastq
    └───📁30
        └───📁raw_data
            ├───🧬reads_R1.fastq
            └───🧬reads_R2.fastq

Now, copy this dataset to the work_hiv directory:

cp -r ../V-pipe/docs/example_HIV_data/samples .

Configuration#

References#

V-pipe comes with pre-configured references. You can specify the species you are using in the configuration file at general.virus_base_config. If you are working with a reference that is not pre-configured, you can specify it in the configuration file at input.reference. For more information see the documentation.

In this tutorial we will use a reference already available in V-pipe/resources/hiv/HXB2.fasta, but instead of specifing virus_base_config we will specify the reference directly in the configuration file. We will also seperately specify the genome annotation (gff file) and the metainfo file.

Populating config.yaml#

In the work_hiv directory you can find the file config.yaml. Open it in your editor and add the following content:

general:
    virus_base_config: ""
    aligner: bwa
    snv_caller: shorah
    haplotype_reconstruction: haploclique

input:
    # the references are part of the repository in this case:
    reference: "../V-pipe/resources/hiv/HXB2.fasta"
    metainfo_file: "../V-pipe/resources/hiv/metainfo.yaml"
    gff_directory: "../V-pipe/resources/hiv/gffs/"
    datadir: samples/
    # we specify the read length here, as it is not the default 250:
    read_length: 301
    samples_file: samples.tsv
    paired: true

snv:
    consensus: false

output:
    snv: true
    local: true
    global: true
    visualization: true
    QA: false
    diversity: true

You can also get this yaml file from the V-pipe repository:

cp ../V-pipe/docs/example_HIV_data/config.yaml .

Note

A YAML files use spaces as indentation, you can use 2 or 4 spaces for indentation, but no tab. There are also online YAML file validators that you might want to use if your YAML file is wrongly formatted.

Running V-pipe#

Before running check what will be executed:

./vpipe --dryrun

As this is your first run of V-pipe, it will automatically generate the sample collection table (samples.tsv). Check samples.tsv in your editor. It is always a good idea check the content of the samples.tsv file, as it is used to collect the samples for the analysis. Of course, you can also provide samples.tsv yourself, before running the pipeline. If you did not use the expected directory structure, this file might end up empty or some entries might be missing. If so, you can safely delete it and re-run with option --dry-run to regenerate it. More information on the samples.tsv file can be found in the documentation.

Finally, we can run the V-pipe analysis. The first run will take a while because it will install all necessary software dependencies with conda:

cd vp-analysis/work_hiv/

./vpipe -p --cores 2
# -p and --cores (and all other options) are passed to snakemake. -p is for printing shell cmds. 
# takes a while to run, needs to install packages

Note

Note that vpipe is a wrapper for snakemake. All options that are passed to vpipe are options to snakemake. More information about snakemake options can be found in the snakemake documentation.

Output#

The output of the SNV calling step is aggregated in a standard VCF file, located in results/​{hierarchy}​/variants/SNVs/snvs.vcf. You can open it with your favorite VCF tools for visualisation or downstream processing. It is also available in a tabular format in results/​{hierarchy}​/variants/SNVs/snvs.csv.