De Novo Metagenome Assembly of Short Read Sequencing#
List of Assemblers#
From Table. 1 of New approaches for metagenome assembly with short reads. Popular tools are in bold.
Tool | Method | Key concepts | Reference |
---|---|---|---|
SPAdes and metaSPAdes | dBg | SPAdes started out as a tool aiming to resolve uneven coverage in single cell genome data; metaSPAdes builds specific metagenomic pipeline on top of SPAdes. Multiple kmer sizes of dBG, starting with lowest kmer size and adding hypothetical kmers of (pref smallest useful size) to connect graph. | Bankevich et al. 2012 [18] Nurk et al. 2017 [43] |
MEGAHIT | dBg | Solid kmers (occur more than a set threshold) and mercy kmers (remainder); mercy kmers that occur between two solid kmers in a read are kept; build a succinct dBG (dBG with Burrows-Wheeler Transform); remove tips, bubbles, progressively remove low local coverage edges; increasing kmer size, extract kmers from contigs and reads, build next graph. | Li et al. 2015 [22] Li et al. 2016 [39)] |
IDBA-UD | dBg | Build graph; remove dead ends (<2 k-1); merge bubbles; break graph on progressive (local) depth; error correction in reads (map reads to confident contigs; reads which match in all but a few bases can be ‘corrected’ to map perfectly); use mate pair info to build a ‘local’ assembly, avoid repeats and chimeras; hold trivial contigs, remove reads; make next graph; after k_max, partitions graph, clips tips, based on progressive (local) depth; Paired end reads requires long contigs to be effective. | Peng et al. 2012 [27] |
BBAP | OLC | Blast-based overlap assembly, with optional intermediary assembly stage. | Lin et al. 2017 [33] |
Genovo | OLC | Generative probabilistic model; applies a series of hill-climbing steps iteratively until convergence; randomly (CRP prior) picks a contig to align read ‘i’ to breaks up chimeric contigs by taking the edge reads off of contigs every ~5 iterations. | Laserson et al. 2011 [34] Afiahayati et al. 2013 [35] |
IVA (iterative virus assembler) | OLC | Aimed at viruses. Greedy kmer-based extension. The most abundant kmer in the set is used as a seed, and this seed is grown out using a read that perfectly maps to it. A new kmer is drawn from the prefix of this read, which must be much more abundant than any other of the same size and occur more than 10 times in the data set. | Hunt et al. 2015 [36] |
MAP | OLC | Reads are filtered before overlap (reduce pairwise alignments made), simple paths found first, mate pair support used to simplify paths, edges removed with contradictory/insufficient mate pair support. | Lai et al. 2012 [37] |
MegaGTA | dBg | Guided assembly targeting specific genes. Employs HMM profile model, iterative kmers and succinct dBg. | Li et al. 2017 [38] |
MetaVelvet | dBg | dBG is first built with Velvet; population structure estimated from coverage of nodes (poisson distributions); dBg is partitioned into hypothetical subgraphs (possibly different species) using these peaks as a guide; only nodes from primary distribution are considered—chimeric and repeat contigs are identified and split by paired end info and coverage differences. Assembly produced for primary distribution; procedure repeated for next. | Namiki et al. 2012 [40] |
MetaVelvet-SL | dBg | Similar to MetaVelvet, but the decision for identifying chimeric contigs is done using an SVM trained on (Paired ends, coverage, contig lengths) for each dinucleotide (AA, AT...GG); a training set is generated from a similar population, the SVM is trained on this, then passed over the dBg for decomposition. | Afiahayati et al. 2015 [30] |
Omega | OLC | Read prefix/suffix (+/−) are stored in hashes; graph is built of V(r); simple paths (1 in, 1 out) are contracted, and transitive edges are reduced; tips removed (<10r) and bubbles are removed (hold edges with more r); minimum cost flow analysis for short (<1000 bp) contigs; Mate pair inserts are estimated from the assembly now, used to support contigs; scaffolding with long mate pair reads; remaining unresolved contigs are merged on similar coverage. | Haider et al. 2014 [28] |
PRICE | Hybrid | Reads are ‘collapsed’ if identical, then if near identical; then (single strand) dbg used to assemble (essentially)—greedy walking, start at highest coverage; identical contigs collapsed, then near identical contigs (ungapped) and finally gapped. | Ruby et al. 2013 [31] |
Ray Meta | dBg | Extension of Ray—no graph partitioning performed, doesn’t use a single peak for kmer coverage, min and peak coverage are specific for each seed path; heuristics-based graph traversal; graph is coloured according to an expected taxonomic profile. | Boisvert et al. 2012 [29] |
SAVAGE | OLC | Aimed at viral quasi-species recovery. Strict overlap conditions reproduce quasi-species assembly with minimal misassemblies. | Baaijens et al. 2017 [41] |
Snowball | Iterative joining | Guided assembly targeting specific genes. Overlapping paired-end read are merged, then assigned to profile domains. Consensus reads assembled for each domain by iterative joining. | Gregor et al. 2016 [42] |
VICUNA | Overlap | A min hash algorithm based on pairwise genetic distance threshold, inexact matching first (reads with similar or identical hash are merged) and then string matching of prefix/suffix of hashes is matched; (optional) target-like reads are kept first (similar reads binned, similarity of bin is used), everything else removed. | Yang et al. 2012 [24] |
Xander | dBg | Guided assembly targeting specific genes. Employs HMM profile model. | Wang et al. 2015 [44] |
dBg = De Bruijn Graph
OLC = Overlap Layout Consensus
Refer to Langmead's course for additional background reading on assembly.