This lesson is in the early stages of development (Alpha version)

Cancer Genomics

Overview

Teaching: 2h min
Exercises: 3h min
Questions
Objectives

Open your terminal

To start we will open a terminal.

  1. Go to the link given to you at the workshop
  2. Paste the notebook link next to your name into your browser
  3. Select “Terminal” from the “JupyterLab” launcher (or blue button with a plus in the upper left corner)
  4. After you have done this put up a green sticky not if you see a flashing box next to a $

Slides

You can view Nico’s talk here

Generate somatic variant calls with Mutect2

Mutect2 is a somatic mutation caller in the Genome Analysis Toolkit (GATK), designed for detecting single-nucleotide variants (SNVs) and INDELs (insertions and deletions) in cancer genomes. The input for Mutect2 includes preprocessed alignment files (CRAM files). The CRAMs include sequencing reads that have been aligned to a reference genome, sorted, duplicate-marked, and base quality score recalibrated. Mutect2 can call variants from paired tumor and normal sample CRAM files, and also has a tumor-only mode. Mutect2 uses the matched normal to additionally exclude rare germline variation not captured by the germline resource and individual-specific artifacts.

Executing Mutect2

In this example Mutect2 will be run in a few intervals. This will speed up a minature test run of a Mutect2 command. In reality you might decide to restrict Mutect2 to the WGS calling regions recommended in the Mutect2 resource bundle.

# create a BED file of chr22 intervals to do a fast test run
cat \
  /data/cancer_genomics/allSitesVariant.bed \
  | grep "^chr22" \
  > ~/workspace/output/allSitesVariant.chr22.bed

# run Mutect2 calling command
gatk Mutect2 \
  -R /data/cancer_genomics/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  --intervals ~/workspace/output/allSitesVariant.chr22.bed \
  -I /data/cancer_genomics/COLO-829_2B.variantRegions.cram \
  -I /data/cancer_genomics/COLO-829BL_1B.variantRegions.cram \
  -O ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.unfiltered.vcf

Going further with Mutect2 runs (this is a slower command not run in the workshop)

Real projects should be filtered for technical artifacts with a Panel of Normals (PON) and with common germline variants. The --germline-resource and --panel-of-normals flags are used if Mutect2 is the tool filtering your somatic calls. In the example below the population allele frequency for the alleles that are not in the germline resource is 0.00003125. Read more in the Mutect2 documentation.

gatk Mutect2 \
  -R /data/cancer_genomics/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -I /data/cancer_genomics/COLO-829_2B.variantRegions.cram \
  -I /data/cancer_genomics/COLO-829BL_1B.variantRegions.cram \
  --germline-resource AF_ONLY_GNOMAD.vcf.gz \
  --af-of-alleles-not-in-resource 0.00003125 \
  --panel-of-normals YOUR_PON.vcf.gz \
  -O ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.unfiltered.vcf

Filter somatic variant calls

After calling the variants, we filter out low-quality calls using the FilterMutectCalls tool which applies various filters like minimum variant-supporting read depth and mapping quality to distinguish true somatic mutations from artifacts. The full set of filters is described in the Mutect2 GitHub repository.

Filtering Mutect2 calls

gatk FilterMutectCalls \
  --reference /data/cancer_genomics/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -V ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.unfiltered.vcf \
  --stats ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.unfiltered.vcf.stats \
  -O ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf

Anatomy of a VCF file & bcftools

VCF file spec

MAF file spec

bcftools

What is a VCF?

Variant Call Format (VCF) files are a widely used file format for representing genetic variation, with an official specification maintained by the Global Alliance for Genomics & Health (GA4GH). This organization also maintains the specifications for the SAM, BAM, and CRAM file formats. The VCF specification has evolved over time, and can now represent SNVs, INDELs, structural variants, and copy number variants, as well as any annotations. A large ecosystem of software exists for parsing and manipulating VCFs, the most useful of which is BCFtools.

BCFtools is maintained by the same folks as SAMtools, and shares a lot of underlying code (namely the HTSlib library). For most routine tasks, such as sorting, filtering, annotating, and summarizing, BCFtools provides utilities so you don’t have to worry about your own implementation. For more complicated queries and analysis, you may need to write your own code, or use a command line tool with more sophisticated expressions like vcfexpress.

Another commonly used file format for representing variants is the Mutation Annotation Format (MAF). This format’s origins lie in The Cancer Genome Atlas (TCGA) project. It’s only meant to represent SNVs and INDELs, and the specification is quite rigid with respect to the allowed annotations. However, the data is represented in a convenient tabular format and the maftools R package exists for easy plotting, analysis, and comparison to TCGA.

File structures

Like SAM/BAM files, VCFs are split into two parts: the header, followed by the variant calls.

Exercise: Print the help menu for BCFtools

$ bcftools -h

Exercise: Print just the header with BCFtools, print just the variants with bcftools view

$ # view VCF header
$ bcftools view -h ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf
$ # alternately...
$ bcftools head ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf
$ # view the variant calls
$ bcftools view -H ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf | head

Screenshot 2025-05-30 at 3 37 08 PM

VCFs can contain information about multiple samples. In a cancer context, this is usually the paired tumor and normal, but you can imagine a situation in which we have multiple tumors from the same patient (e.g. a primary and metastasis). Each variant call can be thought of as having two parts, variant-level information (e.g., position, reference allele, impact on gene coding sequence), and more fine-grained sample-level information for a given variant (e.g. VAF, sequencing depth).

Screenshot 2025-05-30 at 3 37 13 PM

Exercise: Filter out the Mutect2 calls with bcftools filter

Typically, we want to filter somatic variants to reduce false positives. The FILTER column indicates the relevant filters for a given variant call. Before filtering, this field is blank or contains just a period (“.”). In the earlier filtering step, Mutect2 “soft-filtered” the calls. That is, it filled in the FILTER column for us, but did not yet remove the calls. Sometimes it’s helpful to inspect these flagged calls to debug some downstream issue. To “hard filter”, we want to only keep calls with a PASS in the FILTER column.

Solution

bcftools filter -i "FILTER='PASS'" \
-o ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.vcf \
~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf

Note: bcftools filter can also utilize other fields for filtering (e.g. VAF). Read more about filtering expressions here.

Screenshot 2025-05-30 at 4 02 09 PM

The INFO field contains variant level annotations such as gene impact and population frequency. These are typically added by tools such as Ensembl VEP, or bcftools annotate.

Each annotation should have a corresponding entry in the header indicating how the annotation should be parsed for bcftools and other tools (e.g. python’s Pysam or R’s VariantAnnotation), and a plaintext description for you, the user.

Screenshot 2025-05-30 at 4 02 39 PM

The FORMAT column describes the order of sample-level annotations. Each sample is given a column after the FORMAT column, with the order described in FORMAT, and each annotation detailed in the header.

Screenshot 2025-05-30 at 4 02 58 PM

Exercise: What samples are contained in the Mutect2 VCF?

Hint

Try using bcftools head

Finally, the reference genome contigs used in variant calling are also listed in the header.

Bgzip and tabix indexing

Tabix: fast retrieval of sequence features from generic TAB-delimited files

Sort the Mutect2 VCF, bgzip, and tabix-index VCF with bcftools sort, bgzip, and tabix

Oftentimes, we want to access variants based on their position in the genome. Doing this in a reasonable amount of time requires an index. BCFtools comes with the utilities bgzip and tabix for accomplishing this. Bgzip implements a modified version of the gzip compression algorithm, compressing the data in blocks of a predetermined size. These bgzipped files are perfectly standards-compliant gzip files and will still work with zcat/gunzip/etc. Tabix is a general purpose algorithm for indexing coordinate-sorted genomic coordinate data in tabular format that relies on bgzip compression to work.

Solution

bcftools sort \
-o ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf \
~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.vcf
bgzip ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf 
tabix ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf.gz

# Shorter answer (done in a single short bcftools command)

bcftools sort \
-O z -Wtbi \
-o ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf.gz \
~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.vcf

Solution

bcftools view \
~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf.gz \
'chr22:10000000-110000000'

Other BCFtools examples

Extracting variants and plotting the VAF distribution

Extract chromosome, position, and tumor VAF from the VCF. In order to extract the correct VAF values, you’ll need to provide BCFtools with the tumor sample name listed in the header.

Extracting information from VCFs

Solution

# copy a VCF from a bucket
mkdir ~/workshop/data/
gcloud storage cp \
gs://nygc-workshop-2024/cancer_genomics/COLO-829_2B--COLO-829BL_1B.snv.indel.results.v7.annotated.vcf.gz* \
~/workshop/data/

bcftools query \
-s COLO-829_2B \
-f "%CHROM\t%POS\t[%AF]\n" \
~/workshop/data/COLO-829_2B--COLO-829BL_1B.snv.indel.results.v7.annotated.vcf.gz \
> ~/workshop/output/COLO-829_2B--COLO-829BL_1B.vafs.tsv

Plot VAF plot for chr21 and chr22. Identify something different on chr22. (for loop)

Solution (run in R terminal)

# pdf('~/workshop/output/vaf_histogram.pdf') # use lines like these to save to a file when needed)
f = '~/workshop/output/COLO-829_2B--COLO-829BL_1B.vafs.tsv'
x = read.table(f, h=F, stringsAsFactors=F,
               sep='\t', col.names=c('chr', 'pos', 'vaf')) 

for (chr in unique(x$chr)) {
   hist(x$vaf[x$chr == chr], breaks=20, main=chr)
}
# dev.off() # use lines like these to save to a file when needed)

What do you think the source of the low VAF variants on chr22 could be?

Inspecting alignments with IGV

Good candidates:

  1. chr22:22949473 - a straightforward clonal SNV
  2. chr1:3243220 - a straightforward deletion
  3. chr2:48028500 - a straightforward insertion

Bad candidates:

  1. chr22:24,098,035 and chr22:24,098,086 - supporting reads have multiple mismatches, soft-clip on the left
  2. chr22:29289914-29290094 - Five clustered variants all supported by the same set of reads, with soft-clip on the left. “Group by base at …” is our friend here
  3. chr22:19132585 - 7bp insertion with flanking mismatches and soft-clipping

Bonus exercise: What organism is the contamination coming from?

Solution

Right-click one of the artifact-supporting reads > Copy read sequence > paste into BLAST Screenshot 2025-05-30 at 4 04 32 PM

Mutational Signatures (working in Python)

Resources:

Step-by-step version:

  1. Create input vcf folder
mkdir -p /home/student/workshop/input_vcfs
mkdir -p /home/student/workshop/output

cp /data/cancer_genomics/COLO-829_2B--COLO-829BL_1B.snv.indel.high_confidence.v7.annotated.vcf \
~/workshop/input_vcfs/

cd ~/workshop/
  1. Install genome
from SigProfilerMatrixGenerator import install as genInstall

genInstall.install('GRCh38', rsync=False, bash=True)
  1. Generate mutational spectrum count matrix
from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen

matrices = matGen.SigProfilerMatrixGeneratorFunc("SIW_2025", "GRCh38", "/home/student/workshop/input_vcfs/", plot=True, exome=False, bed_file=None, chrom_based=False, tsb_stat=False, seqInfo=False, cushion=100)

Compare the SBS96 count matrix to the COSMIC database here. Looks closest to SBS7a (UV exposure) (/home/student/workshop/input_vcfs/output/plots/SBS_96_plots_SIW_2025.pdf) Screenshot 2025-05-30 at 4 06 22 PM

  1. Deconvolve mutational signatures
    from SigProfilerAssignment import Analyzer as Analyze
    
    Analyze.cosmic_fit(samples="/home/student/workshop/input_vcfs/output/SBS/SIW_2025.SBS96.all", 
                    output="/home/student/workshop/output", 
                    input_type="matrix", context_type="96", 
                    genome_build="GRCh38")
    
  2. Look at plots of mutational signatures (/home/student/workshop/output/Assignment_Solution/Activities/Assignment_Solution_Activity_Plots.pdf)

    Solution

    SBS5: Unknown clock-like signature SBS7a: UV-light exposure SBS7b: UV-light expsure SBS38: Unknown. Found only in ultraviolet light associated melanomas suggesting potential indirect damage from UV-light.

    Screenshot 2025-05-30 at 4 08 40 PM

Cohort-level analysis

Most cancer genomics studies involve the study of a whole cohort of samples, not just one or two. Furthermore, we often want to compare to previously-published data to uncover new biology, and as a way to sanity-check our data – e.g, did something go wrong during the many steps to get from biopsy to variant calls?

To dip our toe into cohort-level analysis, we’ll be using the R maftools package and curated TCGA data from its companion package, TCGAmutations. To start, let’s spin up an interactive R session.

Maftools user guide

TCGAmutations user guide

Start an interactive R session

Next, let’s load up the R packages we’ll need. Check what TCGA datasets are available and load up the breast cancer cohort.

Solution


dir.create("~/site-library")
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager", lib="~/site-library")
.libPaths("~/site-library")
library(BiocManager)
BiocManager::install("maftools", lib="~/site-library")
BiocManager::install("PoisonAlien/TCGAmutations", lib="~/site-library")
library(maftools)
library(TCGAmutations)

tcga_available()

brca = tcga_load(study = "BRCA")

In an interactive R session, just typing the variable will print its contents. Let’s check what’s inside our brca variable.

Next, let’s plot a basic summary of the cohort, including the variant type, coding consequences, reference and alternate alleles, tumor mutation burden, and the top 10 most mutated genes.

Solution

# pdf('maf_summary.pdf' # use when you need to make a pdf
plotmafSummary(maf=brca, rmOutlier=TRUE, addStat='median', dashboard=TRUE, titvRaw=FALSE)
# dev.off()

Screenshot 2025-05-30 at 4 10 02 PM

We can expand out the top 10 most mutated genes into an oncoprint (also called an oncoplot), showing distribution of mutations in each sample. Oncoprints can contain information from SNVs and INDELs, but also copy number variants.

Exercise: Generate an oncoprint of the top 10 most mutated genes in the breast cancer cohort

Solution

# pdf('oncoprint.pdf')
oncoplot(maf=brca, top=10) 
# dev.off()

Screenshot 2025-05-30 at 4 10 29 PM

It can also be useful to zoom in on specific genes, to understand how the mutations are distributed within the genes – are they clustered in a few “hotspots”?

Exercise: Generate lollipop plots of top 3 most frequently mutated genes in the oncoprint

Solution

#pdf('lollipop.pdf')
lollipopPlot(maf=brca, gene='PIK3CA', AACol=HGVSp_Short, showMutationRate=TRUE)
lollipopPlot(maf=brca, gene='TTN', AACol=’HGVSp_Short’, showMutationRate=TRUE)
lollipopPlot(maf=brca, gene='TP53', AACol=’HGVSp_Short’, showMutationRate=TRUE)
#dev.off()

Screenshot 2025-05-30 at 4 11 21 PM Screenshot 2025-05-30 at 4 11 09 PM Screenshot 2025-05-30 at 4 11 00 PM

Notice anything different when comparing the distribution of mutations in TTN versus PIK3CA and TP53? TTN is a very long gene and accumulates mutations by chance. Therefore, mutations in this gene are uniformly distributed, rather than clustered in regions of functional importance. The distribution of mutations in a gene’s coding sequence is a signal utilized in some driver gene discovery tools such as OncodriveCLUST.

As mentioned earlier, it can be helpful to comapre a cohort against previously-published data. Let’s compare this cohort’s tumor mutation burden (TMB) to the rest of TCGA.

Solution

# pdf('tcga_comparison.pdf')
tcgaCompare(maf=brca, cohortName='Workshop', logscale=TRUE, capture_size=50)
# dev.off()

Screenshot 2025-05-30 at 4 11 51 PM

The oncoprint hinted at patterns of mutual exclusivity. We can check this in a more statistically rigorous manner with the somaticInteractions function.

Exercise: Check patterns of co-occurence and mutual exclusivity

Solution

# pdf('interactions.pdf')
somaticInteractions(maf=brca, top=20, pvalue=c(0.05, 0.1))
# dev.off()

Screenshot 2025-05-30 at 4 12 08 PM

Key Points