A single lane on an Illumina sequencer can generate in excess of 200 million reads – far in excess of what is needed for many common sequencing applications. It is therefore common to put multiple samples in the same lane, and to use barcode sequences to be able to determine the original source for each read so you can separate them later on.
The most common use of barcoding is to mix together multiple samples from the same experiment, where all of the samples will be of the same type and from the same species, but we have frequently seen cases where people want to mix very different libraries together. Whilst we have strongly recommended against this for a long time (and don’t even allow samples to be annotated that way in our LIMS), people persist in doing this so it seemed worthwhile illustrating why this is a bad idea.
The root problem here is that barcoding is not a foolproof technology. Even with stringent separation criteria (all barcodes must be read exactly correctly over their entire length), you will still get a small amount of cross contamination between the libraries within a single lane. Whilst this will be a randomly selected subset of reads and will represent only a fraction of a percent of the total data, this can still have a devastating effect on the analysis of some library types and is very difficult to fix retrospectively.
This issue is well illustrated in by the following genome view of two ChIP-Seq lanes which are ChIPing for the same factor.
On the face of it it looks like the top sample has a number of extra peaks which would appear to be novel binding events. A closer examination however makes these novel peaks appear somewhat suspicious. In particular you can see that the novel peaks are directional (red = top strand, blue = bottom strand), whereas ChIP peaks would normally not be expected to exhibit any directionality.
If we look more closely at one of the peaks then things get more suspicious.
Now we look more closely it becomes very obvious that this is an issue of contamination. The extra peaks sit within the exons of transcripts and their directionality is always opposite to the direction of the feature. Our suspicion therefore would be that this ChIP sample has become contaminated with a reverse strand specific RNA-Seq library. Looking at the sequencer annotation we find that this is exactly what has happened – the user had mixed ChIP and RNA-Seq samples in the same lane and even though the libraries hadn’t seen each other until they were mixed within the flowcell, we still see this mis-annotation of reads.
The level of cross contamination seen here is very high compared to what you’d normally expect, but in this case it was exacerbated by a secondary effect – this was a highly duplicated library, so the user had deduplicated their data. The deduplication had removed a large proportion of the ChIP library, but the contaminating RNA-Seq data was not duplicated and wasn’t affected by the deduplication – amplifying the effect of the contamination. The same sort of effect will happen in other libraries though but may not be as obviously visible.
There is no way to completely prevent mis-annotation of barcoded reads in a sample. There are enough cases where you see a perfect, but incorrect, barcode that there’s nothing you can do to remove these. This may be an optical issue (mismatching the reads from barcodes and inserts), or some kind of internal recombination the flowcell, but really the exact source doesn’t matter for the purposes of this article.
The mitigation here is to make sure that the way you mix your libraries together will mean that it won’t matter if a very small percent of reads leak from one library to another. Generally this means keeping the same library type in the same lane.
If you have two similar libraries (genomic libraries for example) in the same lane and a tiny proportion of crossover occurs then since the overall distribution of reads between the two libraries is likely to be similar the proportional effect on any given position within the libraries will be negligible. However if your libraries have strong biases in their composition (RNA-Seq for example) then a random selection of reads from these will not be randomly distributed over the genome. A highly expressed gene in RNA-Seq could potentially have millions of reads over it, so that tens or hundreds of these reads could end up cross contaminating other libraries, making it look like something interesting was happening at this position.
To illustrate this I took a pretty normal RNA-Seq library with ~20million reads in it and randomly select 2000 reads from it (that’s around 0.01% of the data – easily within the range of mis-annotated reads). You can see that the reads are not uniformly distributed over the whole genome but preferentially fall within a few regions which relate to the most highly expressed genes.
Looking in more detail at a single locus you can see that although the absolute number of reads is not enormous, if these were overlaid on a flat background they would appear to be a significant local enrichment. For enrichment seeking library types such as ChIP this could easily be misinterpreted, and if you were looking for rare events in a SNP detection or something sensitive such as a Hi-C experiment then this could appear to be a hugely enriched signal.
Sequencers aren’t perfectly clean in their ability to separate barcode mixes put into them. You can minimise the effect of cross-contamination by only mixing samples from within the same experiment, but even then you might want to consider whether your results could be biased by the addition of a small number of randomly selected reads from another library in the same lane.