Introduction
Illumina sequencing technology is based around sequencing by synthesis. What this means is that instead of sequencing the entirety of one sequence before moving on to another, all of the sequences in a run are sequenced simultaneously. Sequencing progresses by running a chemistry cycle which adds a tagged base to the end of each sequence cluster (which generates a single output sequence), followed by an imaging step where the newly added base is read.
It is commonly observed that as the number of sequencing cycles performed is increased the average quality of the base calls, as reported by the Phred Scores produced by the sequencer falls. The rate at which this fall happens will vary according to the type of sequencer used, the version of the sequencing chemistry and the nature of the library being sequenced.
The Symptoms
This degradation in base call quality is most often observed in a plot of Phred scores vs chemistry cycle. This plot is produced internally in the illumina sequencing software but is also trivially reproduced by many of the common QC packages.
Diagnosis
This loss of quality is really an expected side effect of the way the illumina platform works. In the Illumina sequencing system you do not sequence a single molecule, but rather an ensemble of identical molecules called a cluster. Clusters are necessary in order to generate enough signal to be seen by the imaging system but they also introduce the possibility of generating mixed signals.
When you run a chemistry cycle the assumption is that every molecule on the flowcell is extended by one base, but this isn’t actually true. Although most molecules will be extended a very small proportion will escape and will remain on the previous base. The means that after a few cycles of chemistry the signal coming from a cluster will actually be a mix of signals from the current base, but also some signal from the previous few bases. Illumina have a system in place to try to detect this effect, called phasing, and correct for it, but this correction can never be perfect so as more cycles of sequencing are performed the signal from a cluster becomes more mixed and the ability to determine the correct base call diminishes, eventually to the point that the true signal is effectively lost.
Mitigation and Prevention
There isn’t really any work round for this. The illumina base calling software already has code in place to try to correct for this effect, and improvements in the sequencing chemistry have aimed to increase the fidelity of the extensions, but further progress will need to come from illumina themselves.
Software
This effect can easily be detected with the FastQC per base quality plot.
3 thoughts on "Loss of base call accuracy with increasing sequencing cycles"
JH
I’d add that tools like BQSR account for machine cycle as one of many covariates to attempt to recalibrate base qualities.
dridk
Could you explain why the first 6 bases has all the same Phred score in all read ?
Simon Andrews
I think it’s to do with the way that the Illumina system does its calibration. You can see that in the early stages of a run there are a couple of places where the scaling of base calls changes. The machine does an initial calibration from the first few cycles before it goes back to do the base calling, but it also seems to update the calibration a few bases in when it’s looked at some more data. I’d guess that the second calibration changes the dynamic range of the calls (the max Phred score seems to go up at that point certainly). The values aren’t completely fixed as a poor run will still have poor Phred scores at the start, but I suspect the calls are less sensitive to begin with.