To reduce the memory use in the de Bruijn graph step, neither ABySS nor SOAPdenovo take advantage of paired-end information in the initial contig construction. This makes the scaffolding process an important step, with the following functions: 1) Revise initial contigs, if the contig sequence conflicts with paired-end mapping results; 2) Link contigs by paired-end information to form scaffolds; 3) Merge contigs by paired-end information and overlap to form longer contigs or 'scaftigs'. A scaftig refers to a continuous sequence formed by multiple initial contigs lined up in a scaffold with putative sequence overlaps. Scaftig-forming initial contigs that were not successfully linked as one in the original de Bruijn graph could be explained by inadequate overlap compared with the minimum overlap requirement ([k-1] bp) and repetitive sequences that are either removed or cut out. With paired-end link support and the estimated overlap or distance between two initial contigs, it is possible confidently to join them together with a smaller (<(k-1) bp) overlap, a recovered repeat sequence or a local assembly.
Although ABySS does not explicitly claim a scaffolding step, it actually has a second phase that utilises paired-end information to form scaftigs. The reads are first aligned to the initial contigs to create a linked contig set with mispaired/misaligned links filtered. With a minimal link number cut-off (five by default in ABySS) from each contig, the program searches through its linked contigs for single unique paths and consistent paths are used for stitching potential scaftigs. A branch-bound method is applied to constrain exhaustive searches in complex repeat structures, to reduce unnecessary computational intensity. In the first release of ABySS, the resulting contigs had an N50 increase from 860 bp in initial contigs to 1,499 bp, which summed up to 2.18 Gb and covered 68 per cent of human genomes. In the updated version of ABySS, there is an added option '-scaffold' (by default disabled) that could fill Ns (stretches of unknown nucleotides with estimated lengths) in two linked, but not overlapping, contigs. However, the performance of this scaffold assembly has not been reported.
SOAPdenovo explicitly has a scaffolding step, in which it also aligns reads back to initial contigs, creates a contig linkage graph and builds scaffolds. The process is similar to that of ABySS, with some different assembly parameter settings, except for the processing of repeats. SOAPdenovo defines contigs with multiple entries and multiple exits in the linkage graph as repeats and masks them in scaffolding. This means that SOAPdenovo only uses unique sequences on the genome for scaffolding. One important step of SOAPdenovo is the internal gap closure that closes the internal gaps of scaffolds to build scaftigs. SOAPdenovo takes advantage of paired-end information to retrieve reads near a gap and performs overlap-based local assembly, trying to extend the contigs into the gaps and hopefully to solve them. This could, in principle, solve gaps caused by inadequate overlap and repeats, as long as the initial scaffolding is accurate and the paired-end insert size can cover them. This significantly improves the continuity of assembly, with N50 contig size increased from 886 bp to 4,611 bp (without paired-end reads from long-insert libraries), which covers 85 per cent of human genomes. Note that in ABySS, the genome coverage could be 90 per cent without the 100 bp minimum length threshold to claim a valid contig. This suggests that the differences are mainly attributable to contigs smaller than 100 bp.
It should be emphasised that in short-read assembly, connection of initial contigs to form scaftigs is more important than it is in conventional Sanger assembly. Theoretically, a repeat could be solved by either long reads or paired-end reads that span over the repeat. This is the underlying principle that allows short reads to be assembled into large genomes as long as insert sizes of paired-end libraries fit the genome repeat characteristics. A precise and sufficient use of paired-end reads then becomes the critical technical challenge in short-read assembly, however. How well the small gaps are filled could make a large difference to the results.
At this stage, SOAPdenovo outperforms ABySS in both length and genome coverage. As the two programs have pretty similar results in contiging, a major reason for the difference could be the more intensive data utilisation of SOAPdenovo in scaffolding and gap closure. In addition, considering that ABySS could achieve 90 per cent genome coverage by adding small contigs (length < 100 bp), pre-assembly read error correction in SOAPdenovo may also play an important role in the quality of contigs and paired-end mapping reliability to make the final outcome less fragile.
The two genome assemblies presented in the SOAPdenovo paper [8] also show striking differences in scaffolding results, which are mainly explained by the multiple paired-end libraries with step-wise increasing insert sizes. Larger-insert paired-end reads would organise scaffolds of considerable sizes as long as shorter-insert ones solved the interleaving problem. This suggests that an appropriate strategy should be chosen to sequence and assemble a human genome.