Illumina Patterned Flow Cells Generate Duplicated Sequences

The latest Illumina sequencers – such as the HiSeq X, HiSeq 3000 and HiSeq 4000 – use patterned flow cells to enable the discrimination between much more densely packed DNA clusters. While such technology substantially increases the number of reads generated per sequence run, this innovation may lead to an increased number of duplicates, thereby negating the improved yield and making subsequent data analysis potentially more difficult. Further investigation shows that these putative sequencing duplicates are generally in close two-dimensional proximity on a flow cell, which may provide an opportunity to develop bioinformatics solutions to identify and discard such artefacts.

March 2, 2017 HiSeq, All Applications, Bowtie2, HiCUP, Picard

Soft-clipping of reads may add potentially unwanted alignments to repetitive regions

Soft-clipping of sequencing reads allows the masking of portions of the reads that do not align to the genome from end to end, which may be desirable for certain types of analysis (e.g. detection of structural variants). For standard alignment processes soft-clipping may however incorrectly trim reads and lead to the mis-assignments of reads primarily to repetitive regions. This phenomenon appears to vary in severity for different sequencing applications with Bisulfite sequencing being worst off.

May 16, 2016 All Applications, Bismark, Bowtie2, bwa-meth, HISAT2, SeqMonk, Trim Galore!

Illumina 2 colour chemistry can overcall high confidence G bases

With the introduction of the NextSeq system Illumina changed the way their image data was acquired so that instead of capturing 4 images per cycle they needed only 2. This speeds up image acquisition significantly but also introduces a problem where high quality calls for G bases can be made where there is actually no signal on the flowcell.

May 4, 2016 NextSeq, All Applications, Cutadapt, FastQC

PBAT and single-cell (scBS-Seq) libraries may generate chimeric read pairs

Paired-end libraries generated by Post Bisulfite Adapter Tagging (PBAT) often suffer from poorer mapping efficiencies when compared to standard whole genome shotgun Bisulfite-Seq libraries. In addition to the usual suspects that have a detrimental impact on mapping efficiency we found that a substantial proportion of paired-end PBAT libraries appears to consist of chimeric reads that map to different places in the genome, not unlike Hi-C type experiments. Chimeric reads also affect single-cell libraries (scBS-seq) as they are constructed using a PBAT approach.

March 18, 2016 Illumina, Methylation, PBAT, Bismark, Cutadapt, SeqMonk, Trim Galore!

Mispriming in PBAT libraries causes methylation bias and poor mapping efficiencies

Random priming in PBAT libraries introduces drastic biases in the base composition and methylation levels especially at the 5′ end of all reads. As a result, affected bases should be removed from the libraries before the alignment step.

March 11, 2016 Illumina, Methylation, PBAT, BamQC, Bismark, FastQC, Trim Galore!

Barcode splitting doesn’t work as expected

When running multiple samples in the same sequencing reaction the different libraries are usually created with unique sequence tags (barcodes) to allow the sequence from the lane to be separated. Problems during this splitting are common and can have serious effects on downstream analysis.

February 12, 2016 All Technologies, All Applications, Data Processing

Library end-repair reaction introduces methylation biases in paired-end (PE) Bisulfite-Seq applications

Library construction of standard directional BS-Seq samples often consist of several steps including sonication, end-repair, A-tailing and adapter ligation. Since the end-repair step typically uses unmethylated cytosines for the fill-in reaction the filled-in bases will generally appear unmethylated after bisulfite conversion irrespective of their true genomic methylation state.

February 12, 2016 Illumina, BS-Seq, Methylation, Bismark, Data Processing

Deduplicating ambiguously mapped data makes it look like repeats are enriched

In a dataset where you have some degree of technical duplication, and have not filtered your data to only keep uniquely mapping reads then if you perform deduplication it will look as if repeat sequences are enriched.

February 11, 2016 ChIP-Seq, Data Processing, SeqMonk

Read-through adapters can appear at the ends of sequencing reads

Many sequencing platforms require the addition of specific adapter sequences to the end of the fragments to be sequenced. For an individual fragment, if the length of the sequencing read is longer than the fragment to be sequenced then the read will continue into the adapter sequence on the end. Unless it is removed this adapter sequence will cause problems for downstream mapping, assembly or other analysis.

February 7, 2016 Cutadapt, FastQC, Skewer, Trim Galore!