QC Fail Sequencing » Contamination with a different species you can guess

Contamination with a different species you can guess

One of the biggest problems with sequencing libraries is that the material might be contaminated with something unexpected. One of the simplest forms of contamination is where you have material from a different species than expected. In many cases the rogue material will come from a species you can guess based on their other species commonly used in your lab. Screening for this type of contamination will help spot when you have contaminated samples, and can also help when you have completely switched samples.

Introduction

One common mode of failure for sequencing experiments is when mapping the sequence to the expected reference produces poor alignment efficiency. There are many potential reasons for this, but one of the most common is that some significant proportion (or possibly all) of the library comes from a species other than the expected one. Many times where material is from the wrong species, or is contaminated with a secondary species the contaminating species is one which is commonly used in the lab which created the library. It can therefore be useful to screen libraries across a range of commonly used species to determine the degree to which the library matches what is expected.

The Symptoms

Often the initial symptom observed would be a loss of mapping efficiency, but sometimes even the wrong species can map with reasonable efficiency and problems are not revealed until later in the analysis, so some pre-emptive QC on the libraries is useful.

In a simple case you should see the vast majority of the sequences mapping to the expected species.

good_fastq_screen

This plot also illustrates that it can be useful to distinguish uniquely from multi-mapped reads. In this case the library is supposed to be mouse, but around 20% maps to Rat due to the overall similarity between the species. Separating uniquely and multiply mapped sequences allows us to quickly see that the rat mapped sequence is all multi-mapped and probably also maps to mouse. In an extreme case you could see what looks like reasonable mapping to the wrong species if the library is composed of sequences which are very easy to map (low complexity or short sequences for example).

multi_mapping_fastq_screen

Where there is significant contamination or a sample switch this type of screening will not only identity the problem, but will also provide a clue as to the source of the contamination and will say whether the library is completely the wrong species, suggesting a sample switch, or partially contaminated. The example below was supposed to be a mouse library, but can clearly be seen to be mostly E.coli but with some human sequence as well.

contaminated_fastq_screen

Lessons Learnt

It’s always worth investigating the whole content of your library. Even with partial contamination you will get some cross mapping to the species you expect from the contaminant and this can be difficult to spot if this type of screening is not run.

Software

The plots shown in this article were created with FastQ screen which is a program which can screen a proportion of your library against a set of species you select and can customise to match the environment you’re working in.

February 1, 2016 Simon Andrews

Software : FastQ Screen

7 thoughts on "Contamination with a different species you can guess"

Peter

It’s also worth noting that, for example, short reads from E. coli will map to the human genome (even uniquely), if the MapQ score cutoff is allowed to be quite low. See Figure 7 in http://www.ncbi.nlm.nih.gov/pubmed/23872968.

March 3, 2016 at 1:17 pm

Phil Ewels

Yup, nice figure! Though they should also align to E.coli, so should show up with FastQ Screen. Plus the reads in that figure are on the short side – I haven’t seen many 20bp runs for a while now..
- March 3, 2016 at 1:32 pm

Peter

The chapter is mostly concerned with some specific types of experiments that can only produce short reads, e.g. CLIP or ancient DNA sequencing. In those cases, the choice of mapper and settings is much more important than in standard DNA-seq or RNA-seq.

March 3, 2016 at 1:56 pm

Simon Andrews

The heavily cross mapping example shown above is CLIP-Seq, which produces both short and biased (AT rich) sequences – they’re a bit of a nightmare to work with! This is why it’s really useful to map across a set of species and be able to see whether reads are mapping uniquely to a given species.
- March 3, 2016 at 1:58 pm
- Peter
  
  Maybe you can try BWA-PSSM, which is particularly good for mapping reads with funny error models, like from CLIP. http://bwa-pssm.binf.ku.dk/
  - March 5, 2016 at 4:05 pm
  - Simon Andrews
    
    For these types of super-biased libraries I don’t think that ultimately any mapper is going to get a ‘correct’ answer to this. As soon as you have tens (or more) of 100% identical matches for a sequence then no amount of processing is really going to distinguish these. Overall, dealing with ambiguously mapping reads is, as far as I can see, the biggest problem remaining in read mapping, and the default inclusion of ambiguously mapping reads is the biggest source of artefacts we see in people’s analyses (hence the number of articles on this site which revolve around this topic!).
    - March 7, 2016 at 9:07 am

Stephan

To screen for contaminations of species you don’t guess you can use a kmer-based approach as provided by Kraken (https://ccb.jhu.edu/software/kraken/). However, using Kraken you have also to provide a database with sequences to screen for, but this database can comprise thousands of genomes while lookup will be still very fast.

September 29, 2017 at 10:38 am

Comments are closed

Contamination with a different species you can guess

Introduction

The Symptoms

Lessons Learnt

Software

7 thoughts on "Contamination with a different species you can guess"

Peter

Phil Ewels

Peter

Simon Andrews

Peter

Simon Andrews

Stephan