Introduction

In any library which has been generated as the result of random fragmentation or random priming there should be no reason to expect that the base composition in a given cycle should be different to any other.  The overall composition of the library might different between runs (GC content of the source species, enrichment of certain bases, chemical modifications of bases etc), but these should affect all cycles equally.  It is a general tenet of QC that the clear observation of structure in datasets which are supposed to be unstructured should be a cause for concern – yet we often see fairly strong positional sequence biases at the start of random primed libraries, so should this be a cause for concern?

The Symptoms

The problem described here is mostly clearly seen in a per-base sequence content plot.  A typical example of an RNA-Seq library affected by this issue is shown below:

random_priming_bias

You can clearly see the biased sequence in the first ~12 bases of the run. This bias then dissipates over the rest of the run which shows the expected parallel tracks in the base content for each base.  This happens in pretty much all RNA-Seq libraries to a greater or lesser extent.

The cause of this bias it turns out is the random priming step in library production. The priming should be driven by a selection of random hexamers which in theory should all be present with equal frequency in the priming mix and should all prime with equal efficiency.  In the real world it turns out that this isn’t the case and that certain hexamers are favoured during the priming step, resulting in the based composition over the region of the library primed by the random primers.

The question then arises as to whether this bias has any implications for downstream analyses.  There are a couple of potential concerns:

  1. It’s possible that there is increased mis-priming as part of the bias – introducing an increased number of mis-called bases at the start of the sequence
  2. It’s possible that the selection bias introduced will have a significant effect of the ability of the library to fairly measure the content of the original RNA population due to certain sequences being unduly favoured

In practice it seems that neither of these concerns is really a problem.  The bias at the start of the sequences appears to be the result of biased selection of fragments from the library, so high levels of predicted SNPs are not an issue.  The biased selection though doesn’t appear to be strong enough to cause major headaches in downstream quantitation of data.  A strong bias would result in a very uneven coverage of different parts of a transcript based on its sequence content, and most RNA-Seq libraries do not show these types of localised biases (excepting biases from mappability and other factors beyond this effect).  Also the biases are very similar between libraries, so any artefacts which were introduced should cancel out when doing any kind of differential analysis.

Diagnosis

This problem is most easily detected with the FastQC per-base sequence content plot.

Mitigation

People often suggest fixing this issue by 5′ trimming of the reads to remove the biased portion – this however is not a fix.  Since the biased composition is created by the selection of sequencing fragments and not by base call errors the only effect of trimming would be to change from having a library which starts over biased positions, to having a library which starts slightly downstream of biased positions.

Prevention

Ultimately this only fix for this issue will be in the introduction of new library preparation kits with a less bias prone priming step.

Lessons Learnt

Whilst the warnings generated by this problem reflect a real issue it’s not something which can be fixed, and doesn’t seem to have any serious consequences for downstream analysis.  Ironically if you are producing RNA-Seq libraries it would make for better QC if you were to focus on libraries which didn’t have this artefact in them, as they would be the ones which were truly suspicious.

References

Biases in Illumina transcriptome sequencing caused by random hexamer priming.

Hansen KD, Brenner SE, Dudoit S.

Nucleic Acids Res. 2010 Jul;38(12):e131. doi: 10.1093/nar/gkq224. Epub 2010 Apr 14.

 

January 31, 2016

5 thoughts on "Positional sequence bias in random primed libraries"

Leave a Reply

Your email address will not be published.