This lesson is in the early stages of development (Alpha version)

SV Exercises

Overview

Teaching: 0 min
Exercises: 120 min
Questions
  • How do the calls from short read data compare to those from the long read data?

  • How do the

Objectives
  • Align long read data with minimap2

  • Call SVs in short and long read data

  • Regenotype LR SVs in a short read data set

Exercises

  1. Run manta on our SR data for chromosomes 1, 6, and 20
  2. Run minimap2 on chromosome 20
  3. Run samtools merge for LR data
  4. Run sniffles on merged LR bam

Before we really start

Super fun exercise

cd
wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static
chmod a+x bedtools.static
mv bedtools.static miniconda3/envs/siw/bin/bedtools

First

We’ll make our output directory and change our current directory to it.

mkdir /workshop/output/sv
cd /workshop/output/sv

Manta

There are two steps to running manta on our data. First, we tell Manta what the inputs are by running configManta.py.

Relevant files - don’t try to execute this, it is just to help you understand what files you need.

_BAM_ -> /data/alignment/combined/NA12878.dedup.bam

_REF_ -> /data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa

Run the following code blocks one at a time.

/software/manta-1.6.0.centos6_x86_64/bin/configManta.py \
 --bam _BAM_ \
 --referenceFasta _REF_ \
 --runDir _OUT_
./_OUT_/runWorkflow.py \
 --mode local \
 --jobs 8 \
 --memGb unlimited 

Challenge

Run manta on NA12878’s parents (NA12891 and NA12892)

ls -lh _OUT_/results/variants/diploidSV.vcf.gz
-rw-r--r-- 1 student student 137K Aug 27 22:54 manta_NA12878/results/variants/diploidSV.vcf.gz
-rw-r--r-- 1 student student 141K Aug 27 23:04 manta_NA12891/results/variants/diploidSV.vcf.gz
-rw-rw-r-- 1 student student 141K Aug 27 23:07 manta_NA12892/results/variants/diploidSV.vcf.gz

Note - data compression

Storing the raw text of our analyses is very inefficient so we often compress our data using data compression programs (gzip, bgzip). This is comes at a cost of convenience though, we can’t use our “normal” tools (cat, less, grep) to access these files in the same way we couldn’t use less to read our BAM files yesterday. However, many tools come with an equivalent for compressed file usage - zcat, zless, zgrep.

Minimap2

Challenge

If we take the first 1000 lines of a fastq file how many reads do we get?

zcat /data/SV/long_read/inputs/NA12878_NRHG.chr20.fq.gz \
| head -n 20000 \
| minimap2 \
    -ayYL \
    --MD \
    --cs \
    -z 600,200 \
    -x map-ont \
    -t 8 \
    -R "@RG\tID:NA12878\tSM:NA12878\tPL:ONT\tPU:PromethION" \
    /data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.mmi \
    /dev/stdin \
| samtools \
    sort \
    -M \
    -m 4G \
    -O bam \
> NA12878.minimap2.bam

Flags

Sniffles

sniffles \
 --threads 8 \
 --input /data/SV/long_read/minimap2/NA12878_NRHG.minimap2.chr1-6-20.bam \
 --vcf NA12878_NRHG.sniffles.vcf

Regenotyping

Another super fun exercise

We need to make a copy of a file and change some of its contents.

cp /data/SV/inputs/NA12878.paragraph_manifest.txt .

vi NA12878.paragraph_manifest.txt

Change /data/SV/bams/NA12878.chr1-6-20.bam to /data/alignment/combined/NA12878.dedup.bam.

cat

id	path	depth	read length	sex
NA12878	/data/alignment/combined/NA12878.dedup.bam	33.94	150	F

Relevant files - don’t try to execute this, it is just to help you understand what files you need.

_VCF_ -> /data/SV/inputs/HGSVC_NA12878.chr1-6-20.vcf.gz

_MANIFEST_ -> ./NA12878.paragraph_manifest.txt

_REF_ -> /data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa
~/paragraph-v2.4a/bin/multigrmpy.py \
 -i _VCF_ \
 -m _MANIFEST_ \
 -r _REF_ \
 -o _OUT_ \
 --threads 8 \
 -M 400

Secondary challenges

Manta

How many variants do we call using manta?

Solution

zgrep -vc "^#" 

What is the breakdown by type?

Sniffles

How many variants do we call using sniffles?

Solution

zgrep -vc "^#" NA12878_NRHG.sniffles.vcf.gz 
4209

What is the breakdown by type?

Solution

zgrep -v "^#" NA12878_NRHG.sniffles.vcf.gz | cut -f 8 | cut -f 2 -d ";" | sort | uniq -c 
    11 SVTYPE=BND
  1727 SVTYPE=DEL
     1 SVTYPE=DUP
  2464 SVTYPE=INS
     6 SVTYPE=INV

Discussion

Key Points