On most sequencing platforms each base call comes with a corresponding measure of the likelihood that the call produced is incorrect. The reported values are Phred scores which are -10 log10 p, so higher phred scores equate to lower probabilities of an error. In order to allow phred scores to be represented by a single character in file formats an encoding scheme is used where ASCII values for the scores are used to represent the numerical value. Since using raw ASCII values in the normal range of Phred scores (1 – 60 ish) would use a lot of non-printing characters an offset is applied to the phred values before doing the ASCII lookup so that more conventional characters are used.
On all recently generated data the offset used before the ASCII lookup is phred+33, but historically two schemes were used. Illumina used phred+64 to put the scores in the normal a-z range of ASCII, whereas the Sanger centre used phred+33 to align with the start of the normally printing ASCII characters.
Many applications which use phred scores in their analysis will allow you to specify the encoding used in each file so the qualities are correctly interpreted. Setting this value incorrectly can result in quality scores being wildly misinterpreted and consequently very poor results. We have seen many examples of datasets being reported to have one encoding, but which are actually using another, so relying on correct metadata annotation is not a guaranteed way to identify this problem.
The major sequence databases have standardised on Phred+33 as the encoding for all of their data, and will re-encode phred+64 samples which are submitted to them. We have, however, seen examples where samples were initially entered with the wrong encoding, so the phred+64 encoding persists with the repositories.
Depending on the analysis platform used the only symptoms of this effect may be poor output from the pipeline – eg poor mapping efficiency, too high or low SNP calls etc. Some software will be intelligent enough to recognise that the range of values its seeing is unlikely to be correct, but this may only produce a warning rather than an error so could be easy to miss.
Most of the mistakes in encoding are phred+64 data which is represented as phred+33. In this case the quality of the base calls is vastly over-estimated. In many cases you can work out that the encoding must be wrong due to the range of observed values in the data. For example in the figure above the reported encoding was phred+33 but the range of observed values was only consistent with phred+64. The approach is not foolproof though – because the phred+33 and phred+64 ranges overlap it is possible to generate datasets which are compatible with both encoding schemes. This can happen in aggressively quality trimmed phred+33 datasets where the range of values would be compatible with both schemes, with all values either being very good or very bad!
Some software will attempt to auto-detect the encoding used, so checking the detected encoding is a good idea. FastQC does not allow the user to specify the quality encoding used but figures this out from the data. The detected encoding is reported at the top of the per-base quality plot. Phred+64 encoding is listed as illumina 1.5 (or lower). Phred+33 is listed as Illumina 1.9/Sanger.
As long as you can detect that this problem occurred then you should be able to rerun the analysis with the appropriate encoding flags on the software you’re using.
New datasets, especially those from public repositories should be checked with standard QC software and the detected encoding compared with what was expected.
Don’t believe the reported quality encoding on datasets. Check the range of observed values.
SRR619473 in NCBIs short read archive should have phred+33 encoding (all SRA files should), but the range we found when we extracted it was clearly phred+64. Note that this may be fixed by NCBI eventually so re-check the encoding before using this data in any training.