Cancer Genomics
Overview
Teaching: 2h min
Exercises: 3h minQuestions
Objectives
Open your terminal
To start we will open a terminal.
- Go to the link given to you at the workshop
- Paste the notebook link next to your name into your browser
- Select “Terminal” from the “JupyterLab” launcher (or blue button with a plus in the upper left corner)
- After you have done this put up a green sticky not if you see a flashing box next to a
$
Slides
You can view Nico’s talk here
Generate somatic variant calls with Mutect2
Mutect2 is a somatic mutation caller in the Genome Analysis Toolkit (GATK), designed for detecting single-nucleotide variants (SNVs) and INDELs (insertions and deletions) in cancer genomes. The input for Mutect2 includes preprocessed alignment files (CRAM files). The CRAMs include sequencing reads that have been aligned to a reference genome, sorted, duplicate-marked, and base quality score recalibrated. Mutect2 can call variants from paired tumor and normal sample CRAM files, and also has a tumor-only mode. Mutect2 uses the matched normal to additionally exclude rare germline variation not captured by the germline resource and individual-specific artifacts.
Executing Mutect2
In this example Mutect2 will be run in a few intervals. This will speed up a minature test run of a Mutect2 command. In reality you might decide to restrict Mutect2 to the WGS calling regions recommended in the Mutect2 resource bundle.
# create a BED file of chr22 intervals to do a fast test run cat \ /data/cancer_genomics/allSitesVariant.bed \ | grep "^chr22" \ > ~/workspace/output/allSitesVariant.chr22.bed # run Mutect2 calling command gatk Mutect2 \ -R /data/cancer_genomics/GRCh38_full_analysis_set_plus_decoy_hla.fa \ --intervals ~/workspace/output/allSitesVariant.chr22.bed \ -I /data/cancer_genomics/COLO-829_2B.variantRegions.cram \ -I /data/cancer_genomics/COLO-829BL_1B.variantRegions.cram \ -O ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.unfiltered.vcf
Going further with Mutect2 runs (this is a slower command not run in the workshop)
Real projects should be filtered for technical artifacts with a Panel of Normals (PON) and with common germline variants. The
--germline-resource
and--panel-of-normals
flags are used if Mutect2 is the tool filtering your somatic calls. In the example below the population allele frequency for the alleles that are not in the germline resource is 0.00003125. Read more in the Mutect2 documentation.gatk Mutect2 \ -R /data/cancer_genomics/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -I /data/cancer_genomics/COLO-829_2B.variantRegions.cram \ -I /data/cancer_genomics/COLO-829BL_1B.variantRegions.cram \ --germline-resource AF_ONLY_GNOMAD.vcf.gz \ --af-of-alleles-not-in-resource 0.00003125 \ --panel-of-normals YOUR_PON.vcf.gz \ -O ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.unfiltered.vcf
Filter somatic variant calls
After calling the variants, we filter out low-quality calls using the FilterMutectCalls tool which applies various filters like minimum variant-supporting read depth and mapping quality to distinguish true somatic mutations from artifacts. The full set of filters is described in the Mutect2 GitHub repository.
Filtering Mutect2 calls
gatk FilterMutectCalls \ --reference /data/cancer_genomics/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -V ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.unfiltered.vcf \ --stats ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.unfiltered.vcf.stats \ -O ~/workspace/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf
Anatomy of a VCF file & bcftools
What is a VCF?
Variant Call Format (VCF) files are a widely used file format for representing genetic variation, with an official specification maintained by the Global Alliance for Genomics & Health (GA4GH). This organization also maintains the specifications for the SAM, BAM, and CRAM file formats. The VCF specification has evolved over time, and can now represent SNVs, INDELs, structural variants, and copy number variants, as well as any annotations. A large ecosystem of software exists for parsing and manipulating VCFs, the most useful of which is BCFtools.
BCFtools is maintained by the same folks as SAMtools, and shares a lot of underlying code (namely the HTSlib library). For most routine tasks, such as sorting, filtering, annotating, and summarizing, BCFtools provides utilities so you don’t have to worry about your own implementation. For more complicated queries and analysis, you may need to write your own code, or use a command line tool with more sophisticated expressions like vcfexpress.
Another commonly used file format for representing variants is the Mutation Annotation Format (MAF). This format’s origins lie in The Cancer Genome Atlas (TCGA) project. It’s only meant to represent SNVs and INDELs, and the specification is quite rigid with respect to the allowed annotations. However, the data is represented in a convenient tabular format and the maftools R package exists for easy plotting, analysis, and comparison to TCGA.
File structures
Like SAM/BAM files, VCFs are split into two parts: the header, followed by the variant calls.
Exercise: Print the help menu for BCFtools
$ bcftools -h
Exercise: Print just the header with BCFtools, print just the variants with bcftools view
$ # view VCF header
$ bcftools view -h ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf
$ # alternately...
$ bcftools head ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf
$ # view the variant calls
$ bcftools view -H ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf | head
VCFs can contain information about multiple samples. In a cancer context, this is usually the paired tumor and normal, but you can imagine a situation in which we have multiple tumors from the same patient (e.g. a primary and metastasis). Each variant call can be thought of as having two parts, variant-level information (e.g., position, reference allele, impact on gene coding sequence), and more fine-grained sample-level information for a given variant (e.g. VAF, sequencing depth).
Exercise: Filter out the Mutect2 calls with
bcftools filter
Typically, we want to filter somatic variants to reduce false positives. The FILTER column indicates the relevant filters for a given variant call. Before filtering, this field is blank or contains just a period (“.”). In the earlier filtering step, Mutect2 “soft-filtered” the calls. That is, it filled in the FILTER column for us, but did not yet remove the calls. Sometimes it’s helpful to inspect these flagged calls to debug some downstream issue. To “hard filter”, we want to only keep calls with a PASS in the FILTER column.
Solution
bcftools filter -i "FILTER='PASS'" \ -o ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.vcf \ ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.vcf
Note: bcftools filter can also utilize other fields for filtering (e.g. VAF). Read more about filtering expressions here.
The INFO field contains variant level annotations such as gene impact and population frequency. These are typically added by tools such as Ensembl VEP, or bcftools annotate
.
Each annotation should have a corresponding entry in the header indicating how the annotation should be parsed for bcftools and other tools (e.g. python’s Pysam or R’s VariantAnnotation), and a plaintext description for you, the user.
The FORMAT column describes the order of sample-level annotations. Each sample is given a column after the FORMAT column, with the order described in FORMAT, and each annotation detailed in the header.
Exercise: What samples are contained in the Mutect2 VCF?
Hint
Try using
bcftools head
Finally, the reference genome contigs used in variant calling are also listed in the header.
Bgzip and tabix indexing
Tabix: fast retrieval of sequence features from generic TAB-delimited files
Sort the Mutect2 VCF, bgzip, and tabix-index VCF with bcftools sort, bgzip, and tabix
Oftentimes, we want to access variants based on their position in the genome. Doing this in a reasonable amount of time requires an index. BCFtools comes with the utilities bgzip and tabix for accomplishing this. Bgzip implements a modified version of the gzip compression algorithm, compressing the data in blocks of a predetermined size. These bgzipped files are perfectly standards-compliant gzip files and will still work with zcat/gunzip/etc. Tabix is a general purpose algorithm for indexing coordinate-sorted genomic coordinate data in tabular format that relies on bgzip compression to work.
Solution
bcftools sort \ -o ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf \ ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.vcf bgzip ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf tabix ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf.gz # Shorter answer (done in a single short bcftools command) bcftools sort \ -O z -Wtbi \ -o ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf.gz \ ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.vcf
Print the variants overlapping the interval “chr22:10000000-110000000” in the uncompressed VCF with bcftools view. Try the same thing with the bgzipped+indexed VCF
Solution
bcftools view \ ~/workshop/output/COLO-829_2B--COLO-829_829BL_1B.mutect2.pass.sorted.vcf.gz \ 'chr22:10000000-110000000'
Other BCFtools examples
- isec: perform set operations on multiple VCFs (e.g. intersection, union, set differences)
- merge: merge multiple VCFs
- norm: normalize indel representation, specify how multi-allelic sites should be handled
- NOTE: Normalizing INDELs is required before annotation, and when comparing output from multiple variant callers
- reheader: Rename samples and/or change contigs in the header
- query: Extract information from a VCF in a user-defined format
Extracting variants and plotting the VAF distribution
Extract chromosome, position, and tumor VAF from the VCF. In order to extract the correct VAF values, you’ll need to provide BCFtools with the tumor sample name listed in the header.
Extracting information from VCFs
Solution
# copy a VCF from a bucket mkdir ~/workshop/data/ gcloud storage cp \ gs://nygc-workshop-2024/cancer_genomics/COLO-829_2B--COLO-829BL_1B.snv.indel.results.v7.annotated.vcf.gz* \ ~/workshop/data/ bcftools query \ -s COLO-829_2B \ -f "%CHROM\t%POS\t[%AF]\n" \ ~/workshop/data/COLO-829_2B--COLO-829BL_1B.snv.indel.results.v7.annotated.vcf.gz \ > ~/workshop/output/COLO-829_2B--COLO-829BL_1B.vafs.tsv
Plot VAF plot for chr21 and chr22. Identify something different on chr22. (for loop)
Solution (run in R terminal)
# pdf('~/workshop/output/vaf_histogram.pdf') # use lines like these to save to a file when needed) f = '~/workshop/output/COLO-829_2B--COLO-829BL_1B.vafs.tsv' x = read.table(f, h=F, stringsAsFactors=F, sep='\t', col.names=c('chr', 'pos', 'vaf')) for (chr in unique(x$chr)) { hist(x$vaf[x$chr == chr], breaks=20, main=chr) } # dev.off() # use lines like these to save to a file when needed)
What do you think the source of the low VAF variants on chr22 could be?
Inspecting alignments with IGV
Good candidates:
- chr22:22949473 - a straightforward clonal SNV
- chr1:3243220 - a straightforward deletion
- chr2:48028500 - a straightforward insertion
Bad candidates:
- chr22:24,098,035 and chr22:24,098,086 - supporting reads have multiple mismatches, soft-clip on the left
- chr22:29289914-29290094 - Five clustered variants all supported by the same set of reads, with soft-clip on the left. “Group by base at …” is our friend here
- chr22:19132585 - 7bp insertion with flanking mismatches and soft-clipping
Bonus exercise: What organism is the contamination coming from?
Solution
Right-click one of the artifact-supporting reads > Copy read sequence > paste into BLAST
Mutational Signatures (working in Python)
Resources:
- COSMIC SBS signature database
- SigProfilerAssignment documentation
- Paper
- Pre-run test inputs/outputs:
mutational_signatures/pre_run_files
Step-by-step version:
- Create input vcf folder
mkdir -p /home/student/workshop/input_vcfs mkdir -p /home/student/workshop/output cp /data/cancer_genomics/COLO-829_2B--COLO-829BL_1B.snv.indel.high_confidence.v7.annotated.vcf \ ~/workshop/input_vcfs/ cd ~/workshop/
- Install genome
from SigProfilerMatrixGenerator import install as genInstall genInstall.install('GRCh38', rsync=False, bash=True)
- Generate mutational spectrum count matrix
from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen matrices = matGen.SigProfilerMatrixGeneratorFunc("SIW_2025", "GRCh38", "/home/student/workshop/input_vcfs/", plot=True, exome=False, bed_file=None, chrom_based=False, tsb_stat=False, seqInfo=False, cushion=100)
Compare the SBS96 count matrix to the COSMIC database here. Looks closest to SBS7a (UV exposure) (/home/student/workshop/input_vcfs/output/plots/SBS_96_plots_SIW_2025.pdf)
- Deconvolve mutational signatures
from SigProfilerAssignment import Analyzer as Analyze Analyze.cosmic_fit(samples="/home/student/workshop/input_vcfs/output/SBS/SIW_2025.SBS96.all", output="/home/student/workshop/output", input_type="matrix", context_type="96", genome_build="GRCh38")
- Look at plots of mutational signatures
(/home/student/workshop/output/Assignment_Solution/Activities/Assignment_Solution_Activity_Plots.pdf)
Solution
SBS5: Unknown clock-like signature SBS7a: UV-light exposure SBS7b: UV-light expsure SBS38: Unknown. Found only in ultraviolet light associated melanomas suggesting potential indirect damage from UV-light.
Cohort-level analysis
Most cancer genomics studies involve the study of a whole cohort of samples, not just one or two. Furthermore, we often want to compare to previously-published data to uncover new biology, and as a way to sanity-check our data – e.g, did something go wrong during the many steps to get from biopsy to variant calls?
To dip our toe into cohort-level analysis, we’ll be using the R maftools package and curated TCGA data from its companion package, TCGAmutations. To start, let’s spin up an interactive R session.
Start an interactive R session
Next, let’s load up the R packages we’ll need. Check what TCGA datasets are available and load up the breast cancer cohort.
Solution
dir.create("~/site-library") if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager", lib="~/site-library") .libPaths("~/site-library") library(BiocManager) BiocManager::install("maftools", lib="~/site-library") BiocManager::install("PoisonAlien/TCGAmutations", lib="~/site-library") library(maftools) library(TCGAmutations) tcga_available() brca = tcga_load(study = "BRCA")
In an interactive R session, just typing the variable will print its contents. Let’s check what’s inside our brca variable.
Next, let’s plot a basic summary of the cohort, including the variant type, coding consequences, reference and alternate alleles, tumor mutation burden, and the top 10 most mutated genes.
Solution
# pdf('maf_summary.pdf' # use when you need to make a pdf plotmafSummary(maf=brca, rmOutlier=TRUE, addStat='median', dashboard=TRUE, titvRaw=FALSE) # dev.off()
We can expand out the top 10 most mutated genes into an oncoprint (also called an oncoplot), showing distribution of mutations in each sample. Oncoprints can contain information from SNVs and INDELs, but also copy number variants.
Exercise: Generate an oncoprint of the top 10 most mutated genes in the breast cancer cohort
Solution
# pdf('oncoprint.pdf') oncoplot(maf=brca, top=10) # dev.off()
It can also be useful to zoom in on specific genes, to understand how the mutations are distributed within the genes – are they clustered in a few “hotspots”?
Exercise: Generate lollipop plots of top 3 most frequently mutated genes in the oncoprint
Solution
#pdf('lollipop.pdf') lollipopPlot(maf=brca, gene='PIK3CA', AACol=HGVSp_Short, showMutationRate=TRUE) lollipopPlot(maf=brca, gene='TTN', AACol=’HGVSp_Short’, showMutationRate=TRUE) lollipopPlot(maf=brca, gene='TP53', AACol=’HGVSp_Short’, showMutationRate=TRUE) #dev.off()
![]()
![]()
Notice anything different when comparing the distribution of mutations in TTN versus PIK3CA and TP53? TTN is a very long gene and accumulates mutations by chance. Therefore, mutations in this gene are uniformly distributed, rather than clustered in regions of functional importance. The distribution of mutations in a gene’s coding sequence is a signal utilized in some driver gene discovery tools such as OncodriveCLUST.
As mentioned earlier, it can be helpful to comapre a cohort against previously-published data. Let’s compare this cohort’s tumor mutation burden (TMB) to the rest of TCGA.
Solution
# pdf('tcga_comparison.pdf') tcgaCompare(maf=brca, cohortName='Workshop', logscale=TRUE, capture_size=50) # dev.off()
The oncoprint hinted at patterns of mutual exclusivity. We can check this in a more statistically rigorous manner with the somaticInteractions
function.
Exercise: Check patterns of co-occurence and mutual exclusivity
Solution
# pdf('interactions.pdf') somaticInteractions(maf=brca, top=20, pvalue=c(0.05, 0.1)) # dev.off()
Key Points