SV Exercises

Overview

Teaching: 0 min
Exercises: 120 min

Questions

How do the calls from short read data compare to those from the long read data?

How do the

Objectives

Align long read data with minimap2

Call SVs in short and long read data

Regenotype LR SVs in a short read data set

Exercises

Run manta on our SR data for chromosomes 1, 6, and 20
Run minimap2 on chromosome 20
Run samtools merge for LR data
Run sniffles on merged LR bam

Before we really start

Super fun exercise

cd
wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static
chmod a+x bedtools.static
mv bedtools.static miniconda3/envs/siw/bin/bedtools

First

We’ll make our output directory and change our current directory to it.

mkdir /workshop/output/sv
cd /workshop/output/sv

Manta

There are two steps to running manta on our data. First, we tell Manta what the inputs are by running configManta.py.

Relevant files - don’t try to execute this, it is just to help you understand what files you need.
_BAM_ -> /data/alignment/combined/NA12878.dedup.bam

_REF_ -> /data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa

Run the following code blocks one at a time.

/software/manta-1.6.0.centos6_x86_64/bin/configManta.py \
 --bam _BAM_ \
 --referenceFasta _REF_ \
 --runDir _OUT_

./_OUT_/runWorkflow.py \
 --mode local \
 --jobs 8 \
 --memGb unlimited 

Challenge

Run manta on NA12878’s parents (NA12891 and NA12892)

ls -lh _OUT_/results/variants/diploidSV.vcf.gz

-rw-r--r-- 1 student student 137K Aug 27 22:54 manta_NA12878/results/variants/diploidSV.vcf.gz
-rw-r--r-- 1 student student 141K Aug 27 23:04 manta_NA12891/results/variants/diploidSV.vcf.gz
-rw-rw-r-- 1 student student 141K Aug 27 23:07 manta_NA12892/results/variants/diploidSV.vcf.gz

Note - data compression

Storing the raw text of our analyses is very inefficient so we often compress our data using data compression programs (gzip, bgzip). This is comes at a cost of convenience though, we can’t use our “normal” tools (cat, less, grep) to access these files in the same way we couldn’t use less to read our BAM files yesterday. However, many tools come with an equivalent for compressed file usage - zcat, zless, zgrep.

Minimap2

Challenge

If we take the first 1000 lines of a fastq file how many reads do we get?

zcat /data/SV/long_read/inputs/NA12878_NRHG.chr20.fq.gz \
| head -n 20000 \
| minimap2 \
    -ayYL \
    --MD \
    --cs \
    -z 600,200 \
    -x map-ont \
    -t 8 \
    -R "@RG\tID:NA12878\tSM:NA12878\tPL:ONT\tPU:PromethION" \
    /data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.mmi \
    /dev/stdin \
| samtools \
    sort \
    -M \
    -m 4G \
    -O bam \
> NA12878.minimap2.bam

Flags

Minimap
- -a : output in the SAM format
- -y : Copy input FASTA/Q comments to output
- -Y : use soft clipping for supplementary alignments
- -L : write CIGAR with >65535 ops at the CG tag
- –MD : output the MD tag
- –cs : output the cs tag
- -z : Z-drop score and inversion Z-drop score
- -x : preset
- -t : number of threads
- -R : SAM read group line
Samtools
- -M : Use minimiser for clustering unaligned/unplaced reads
- -m : Set maximum memory per thread
- -O : Specify output format

Sniffles

sniffles \
 --threads 8 \
 --input /data/SV/long_read/minimap2/NA12878_NRHG.minimap2.chr1-6-20.bam \
 --vcf NA12878_NRHG.sniffles.vcf

Regenotyping

Another super fun exercise

We need to make a copy of a file and change some of its contents.
cp /data/SV/inputs/NA12878.paragraph_manifest.txt .

vi NA12878.paragraph_manifest.txt
Change /data/SV/bams/NA12878.chr1-6-20.bam to /data/alignment/combined/NA12878.dedup.bam.

cat

id	path	depth	read length	sex
NA12878	/data/alignment/combined/NA12878.dedup.bam	33.94	150	F

Relevant files - don’t try to execute this, it is just to help you understand what files you need.
_VCF_ -> /data/SV/inputs/HGSVC_NA12878.chr1-6-20.vcf.gz

_MANIFEST_ -> ./NA12878.paragraph_manifest.txt

_REF_ -> /data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa

~/paragraph-v2.4a/bin/multigrmpy.py \
 -i _VCF_ \
 -m _MANIFEST_ \
 -r _REF_ \
 -o _OUT_ \
 --threads 8 \
 -M 400

Secondary challenges

Manta

How many variants do we call using manta?
Solution
zgrep -vc "^#" 
What is the breakdown by type?

Sniffles

How many variants do we call using sniffles?

Solution

zgrep -vc "^#" NA12878_NRHG.sniffles.vcf.gz

What is the breakdown by type?

Solution

zgrep -v "^#" NA12878_NRHG.sniffles.vcf.gz | cut -f 8 | cut -f 2 -d ";" | sort | uniq -c 

SVTYPE=BND
SVTYPE=DEL
SVTYPE=DUP
SVTYPE=INS
SVTYPE=INV

Discussion

Key Points

previous episode

NYGC Sequence Informatics Workshop

next episode

SV Exercises

Overview

Exercises

Before we really start

Super fun exercise

First

Manta

Relevant files - don’t try to execute this, it is just to help you understand what files you need.

Challenge

Note - data compression

Minimap2

Challenge

Flags

Sniffles

Regenotyping

Another super fun exercise

Relevant files - don’t try to execute this, it is just to help you understand what files you need.

Secondary challenges

Manta

Solution

Sniffles

Solution

Solution

Discussion

Key Points

previous episode

next episode