This lesson is in the early stages of development (Alpha version)

Structural Variation in long reads

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What are the advantages/disadvantages of long reads?

  • How might we leverage a combination of long and short read data?

Objectives
  • Investigate how long-read data can improve SV calling.

Long read platforms

The two major platforms in long read sequencing are PacBio and Oxford Nanopore.

PacBio’s flagship is the Revio, which produces reads in the 5kb to 35kb range with very high accuracy.

PacBio PacBio read length and quality

Oxford Nanopore produces sequencers that range in size from the MinION, which is roughly smart phone sized to the PromethION, the high throughput version that we have at NYGC. There are some differences in the read outputs of the various platforms but the MinION has been shown to produce N50 read lengths over 100kb with maximum read lengths greater than 800kb using ONT’s ultra-long sequencing prep. The PromethION can produce even greater N50 values and can produce megabase long reads. Typically these reads are lowe overall base quality than PacBio but ONT has steadily been improving the base quality for their data.

ONT MinION ONT PromethION ONT PromethION Read Length ONT PromethION Quality

Advantages of long reads

The advantage of long reads is they map much more uniquely to the genome and can often span repetitive elements in the genome that cause mapping quality issues with short reads. In long reads we are able to detect much larger events and in cases where the event is entirely inside a read we are able to determine the breakpoints with much higher accuracy.

SV calling in long reads

Alignment

Sniffles uses a three step approach to calling SVs. First it scans the read alignments looking for split reads and inline events. Inline events are insertions and deletions that occur entirely within the read. It puts these SVs into bins and then looks for neighboring bins that can be merged using a repeat aware approach to create these clusters of SV candidates. Each cluster is then re-analyzed and a final determination is made based on read support, expected coverage changes and breakpoint variance.

Sniffles

Assembly

We touched on assembly in the short read section but here we actually refer to whole genome assembly compared to local assembly in short reads. By assembling as much of the genome as possible, including novel insertions, we create a bigger picture of our sample. These assembled fragments, called contigs, can then be aligned to the reference. The contigs act as a sort of ultra-long read as they represent many reads stiched together.

Drawbacks

Genotyping LR SVs in SR data

Challenge

Given the section title, what two approaches might we take in creating a hybrid SV call set that uses both long and short reads?

Paragraph

Paragraph

Pangenie

Pangenie

Solution

We can:

  1. Sequence a number of individuals with long reads and genotype those SV calls in our short read sample set.
  2. We can leverage existing long read SV callsets and genotype those SVs into out short read sample set.

Question

Does anyone know of any other technologies being used for structural variation?

Callback

SV tech comparison

Key Points

  • Long-reads offer significant advantages over short-reads for SV calling.

  • Genotyping of long-read discovered SVs in short-read data allows for some scalability.