Introduction
In other articles we have described a progressive or catastrophic loss of quality as a sequencing run proceeds. On some runs though a loss of quality will affect only part of a flowcell, and sometimes this loss will only affect certain sequencing cycles. Transient quality loss can escape normal read trimming since the quality at the 3′ end of the read can be good. Depending on the nature of these losses there can be implications for the types of errors introduced into the sequence data generated which can generate artefacts if they are not identified.
The Symptoms and Diagnosis
The way to look for positional quality losses in sequence data is to plot the quality of the base calls against the physical position on the sequencer from which they came. On illumina sequencers the area of the flowcell is artificially divided into swaths (top and bottom surface of the flowcell) and these are further sub-divided into tiles (arbitrary areas which are analysed separately in the calling pipeline). Looking at the per tile quality will identify these types of errors.
Since there is an expected general loss of quality as sequencing cycles increase it is useful to normalise by cycle when plotting quality. You can therefore create a heatmap where hotspots represent tiles where the average phred score on that tile in that cycle was lower than the average across all tiles for the same sequencing cycle. A good plot would therefore be blank.
There are a few different ways in which problems can manifest themselves.
Firstly you can have a seeming random loss of quality at different positions and cycles:
This is usually indicative of a generalised problem with the run, and with fragility in the base calls. The most common reason for this type of failure would be overloading of the flow cell.
A related problem can appear when you have a broad loss of quality over 4 areas of the flowcell. This type of patterning generally occurs when the overall quality of the run is somewhat low, but the flowcell is not overloaded. A common cause would be runs with very biased sequence composition. The areas which highlight represent the edges of the flowcell where the ability of the imaging system to read the signal is somewhat degraded. The loss of quality in these cases is not normally terrible, and the data is often usable. Please see the separate article on biased composition runs for specific discussion of this issue.
Secondly you can have a quality loss in specific areas which is not present from the start but remains for the remainder of the run.
These types of issue normally represent an obstruction to the imaging system – something as simple as dirt falling on the surface of the flowcell, or something washing through the flowcell and getting stuck. You often see them occurring in pairs as any obstruction would affect both the top and bottom swaths. Sequences from these areas will usually be cleaned by quality trimming.
Finally, the most problematic version of this issue is a temporary loss of quality over a restricted area. This normally occurs from something washing through into the flowcell, obstructing a few cycles, and then washing out.
The problem here is that because the loss of quality is not at the end of the reads then the poor quality data is not removed by trimming. The most common cause of this type of error is air-bubbles becoming trapped in the flowcell. These also cause a secondary issue though – the presence of an air bubble not only stops the imaging from working correctly, it also presents the sequencing reagents from getting to clusters under the bubble. The effect of this is that clusters under a bubble can skip sequencing chemistry cycles, and the last base from before the introduction of the bubble is repeatedly read. When the bubble washes out sequencing resumes, but from the point at which the bubble arrived. This means that the sequences are artificially extended – leading to the appearance of the introduction of insertions in the reads.
If these reads are being used for variant detection then these artefactual insertions can confuse the interpretation of downstream results.
Mitigation
If you can see a loss of quality in specific tiles of the flowcell then it is possible to remove these tiles from the downstream analysis. The illumina processing pipeline has the ability to exclude specific tiles during processing, and the tile position is recorded in the id line of each sequence in a native illumina fastq file so you can filter the data at that level.
Prevention
Other than standard degassing and clean working procedures there is little which can be done of avoid this type of problem occasionally. Performing positional as well as general quality checks should allow you to spot runs in which this type of problem has occurred.
Lessons Learnt
Investigate even small subsets of data with noticeable quality losses since they have the potential to introduce significant biological noise into downstream analysis.
Software
The FastQC per-tile quality plot and the BamQC per-base indel plot will most often spot these types of error.
2 thoughts on "Position specific failures of flowcells"
John Longinotto
Great article! Really cleared-up in my mind how these sorts of errors can arise.
I think it is also worth noting that Base Quality Score Recalibration (BQSR) from GATK can also identify and correct these sorts of systematic biases, and should always be applied when calling variants. It also does more than just cycle checks 🙂
James Hadfield
Nice post. This highlights another reason why variants detected in Illumina sequencing data sets should be checked to see if they all come from the same flowcell/lane/tile. As sequencing depth increases the likelihood of having your whole experimental dataset coming from one flowcell increases, and this increases the risk that a positional artefact (non-extension under a bubble introducing insertions) looks real.