SV Exercises
Overview
Teaching: 0 min
Exercises: 120 minQuestions
How do the calls from short read data compare to those from the long read data?
How do the
Objectives
Align long read data with minimap2
Call SVs in short and long read data
Regenotype LR SVs in a short read data set
Exercises
- Run manta on our SR data for chromosomes 1, 6, and 20
- Run minimap2 on chromosome 20
- Run samtools merge for LR data
- Run sniffles on merged LR bam
Before we really start
Super fun exercise
cd wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static chmod a+x bedtools.static mv bedtools.static miniconda3/envs/siw/bin/bedtools
First
We’ll make our output directory and change our current directory to it.
mkdir /workshop/output/sv
cd /workshop/output/sv
Manta
There are two steps to running manta on our data. First, we tell Manta what the
inputs are by running configManta.py
.
Relevant files - don’t try to execute this, it is just to help you understand what files you need.
_BAM_ -> /data/alignment/combined/NA12878.dedup.bam _REF_ -> /data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa
Run the following code blocks one at a time.
/software/manta-1.6.0.centos6_x86_64/bin/configManta.py \
--bam _BAM_ \
--referenceFasta _REF_ \
--runDir _OUT_
./_OUT_/runWorkflow.py \
--mode local \
--jobs 8 \
--memGb unlimited
Challenge
Run manta on NA12878’s parents (NA12891 and NA12892)
ls -lh _OUT_/results/variants/diploidSV.vcf.gz
-rw-r--r-- 1 student student 137K Aug 27 22:54 manta_NA12878/results/variants/diploidSV.vcf.gz
-rw-r--r-- 1 student student 141K Aug 27 23:04 manta_NA12891/results/variants/diploidSV.vcf.gz
-rw-rw-r-- 1 student student 141K Aug 27 23:07 manta_NA12892/results/variants/diploidSV.vcf.gz
Note - data compression
Storing the raw text of our analyses is very inefficient so we often compress our data using data compression programs (gzip, bgzip). This is comes at a cost of convenience though, we can’t use our “normal” tools (cat, less, grep) to access these files in the same way we couldn’t use less to read our BAM files yesterday. However, many tools come with an equivalent for compressed file usage - zcat, zless, zgrep.
Minimap2
Challenge
If we take the first 1000 lines of a fastq file how many reads do we get?
zcat /data/SV/long_read/inputs/NA12878_NRHG.chr20.fq.gz \
| head -n 20000 \
| minimap2 \
-ayYL \
--MD \
--cs \
-z 600,200 \
-x map-ont \
-t 8 \
-R "@RG\tID:NA12878\tSM:NA12878\tPL:ONT\tPU:PromethION" \
/data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.mmi \
/dev/stdin \
| samtools \
sort \
-M \
-m 4G \
-O bam \
> NA12878.minimap2.bam
Flags
- Minimap
- -a : output in the SAM format
- -y : Copy input FASTA/Q comments to output
- -Y : use soft clipping for supplementary alignments
- -L : write CIGAR with >65535 ops at the CG tag
- –MD : output the MD tag
- –cs : output the cs tag
- -z : Z-drop score and inversion Z-drop score
- -x : preset
- -t : number of threads
- -R : SAM read group line
- Samtools
- -M : Use minimiser for clustering unaligned/unplaced reads
- -m : Set maximum memory per thread
- -O : Specify output format
Sniffles
sniffles \
--threads 8 \
--input /data/SV/long_read/minimap2/NA12878_NRHG.minimap2.chr1-6-20.bam \
--vcf NA12878_NRHG.sniffles.vcf
Regenotyping
Another super fun exercise
We need to make a copy of a file and change some of its contents.
cp /data/SV/inputs/NA12878.paragraph_manifest.txt . vi NA12878.paragraph_manifest.txt
Change
/data/SV/bams/NA12878.chr1-6-20.bam
to/data/alignment/combined/NA12878.dedup.bam
.
cat
id path depth read length sex
NA12878 /data/alignment/combined/NA12878.dedup.bam 33.94 150 F
Relevant files - don’t try to execute this, it is just to help you understand what files you need.
_VCF_ -> /data/SV/inputs/HGSVC_NA12878.chr1-6-20.vcf.gz _MANIFEST_ -> ./NA12878.paragraph_manifest.txt _REF_ -> /data/alignment/references/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa
~/paragraph-v2.4a/bin/multigrmpy.py \
-i _VCF_ \
-m _MANIFEST_ \
-r _REF_ \
-o _OUT_ \
--threads 8 \
-M 400
Secondary challenges
Manta
How many variants do we call using manta?
Solution
zgrep -vc "^#"
What is the breakdown by type?
Sniffles
How many variants do we call using sniffles?
Solution
zgrep -vc "^#" NA12878_NRHG.sniffles.vcf.gz
4209
What is the breakdown by type?
Solution
zgrep -v "^#" NA12878_NRHG.sniffles.vcf.gz | cut -f 8 | cut -f 2 -d ";" | sort | uniq -c
11 SVTYPE=BND 1727 SVTYPE=DEL 1 SVTYPE=DUP 2464 SVTYPE=INS 6 SVTYPE=INV
Discussion
Key Points