One common mode of failure for sequencing experiments is when mapping the sequence to the expected reference produces poor alignment efficiency. There are many potential reasons for this, but one of the most common is that some significant proportion (or possibly all) of the library comes from a species other than the expected one. Many times where material is from the wrong species, or is contaminated with a secondary species the contaminating species is one which is commonly used in the lab which created the library. It can therefore be useful to screen libraries across a range of commonly used species to determine the degree to which the library matches what is expected.
Often the initial symptom observed would be a loss of mapping efficiency, but sometimes even the wrong species can map with reasonable efficiency and problems are not revealed until later in the analysis, so some pre-emptive QC on the libraries is useful.
In a simple case you should see the vast majority of the sequences mapping to the expected species.
This plot also illustrates that it can be useful to distinguish uniquely from multi-mapped reads. In this case the library is supposed to be mouse, but around 20% maps to Rat due to the overall similarity between the species. Separating uniquely and multiply mapped sequences allows us to quickly see that the rat mapped sequence is all multi-mapped and probably also maps to mouse. In an extreme case you could see what looks like reasonable mapping to the wrong species if the library is composed of sequences which are very easy to map (low complexity or short sequences for example).
Where there is significant contamination or a sample switch this type of screening will not only identity the problem, but will also provide a clue as to the source of the contamination and will say whether the library is completely the wrong species, suggesting a sample switch, or partially contaminated. The example below was supposed to be a mouse library, but can clearly be seen to be mostly E.coli but with some human sequence as well.
It’s always worth investigating the whole content of your library. Even with partial contamination you will get some cross mapping to the species you expect from the contaminant and this can be difficult to spot if this type of screening is not run.
The plots shown in this article were created with FastQ screen which is a program which can screen a proportion of your library against a set of species you select and can customise to match the environment you’re working in.
7 thoughts on "Contamination with a different species you can guess"
It’s also worth noting that, for example, short reads from E. coli will map to the human genome (even uniquely), if the MapQ score cutoff is allowed to be quite low. See Figure 7 in http://www.ncbi.nlm.nih.gov/pubmed/23872968.
Yup, nice figure! Though they should also align to E.coli, so should show up with FastQ Screen. Plus the reads in that figure are on the short side – I haven’t seen many 20bp runs for a while now..
The chapter is mostly concerned with some specific types of experiments that can only produce short reads, e.g. CLIP or ancient DNA sequencing. In those cases, the choice of mapper and settings is much more important than in standard DNA-seq or RNA-seq.
The heavily cross mapping example shown above is CLIP-Seq, which produces both short and biased (AT rich) sequences – they’re a bit of a nightmare to work with! This is why it’s really useful to map across a set of species and be able to see whether reads are mapping uniquely to a given species.
Maybe you can try BWA-PSSM, which is particularly good for mapping reads with funny error models, like from CLIP. http://bwa-pssm.binf.ku.dk/
For these types of super-biased libraries I don’t think that ultimately any mapper is going to get a ‘correct’ answer to this. As soon as you have tens (or more) of 100% identical matches for a sequence then no amount of processing is really going to distinguish these. Overall, dealing with ambiguously mapping reads is, as far as I can see, the biggest problem remaining in read mapping, and the default inclusion of ambiguously mapping reads is the biggest source of artefacts we see in people’s analyses (hence the number of articles on this site which revolve around this topic!).
To screen for contaminations of species you don’t guess you can use a kmer-based approach as provided by Kraken (https://ccb.jhu.edu/software/kraken/). However, using Kraken you have also to provide a database with sequences to screen for, but this database can comprise thousands of genomes while lookup will be still very fast.