De Novo Metagenome Assembly of Short Read Sequencing#

List of Assemblers#

From Table. 1 of New approaches for metagenome assembly with short reads. Popular tools are in bold.

Tool	Method	Key concepts	Reference
SPAdes and metaSPAdes	dBg	SPAdes started out as a tool aiming to resolve uneven coverage in single cell genome data; metaSPAdes builds specific metagenomic pipeline on top of SPAdes. Multiple kmer sizes of dBG, starting with lowest kmer size and adding hypothetical kmers of (pref smallest useful size) to connect graph.	Bankevich et al. 2012 [18] Nurk et al. 2017 [43]
MEGAHIT	dBg	Solid kmers (occur more than a set threshold) and mercy kmers (remainder); mercy kmers that occur between two solid kmers in a read are kept; build a succinct dBG (dBG with Burrows-Wheeler Transform); remove tips, bubbles, progressively remove low local coverage edges; increasing kmer size, extract kmers from contigs and reads, build next graph.	Li et al. 2015 [22] Li et al. 2016 [39)]
IDBA-UD	dBg	Build graph; remove dead ends (<2 k-1); merge bubbles; break graph on progressive (local) depth; error correction in reads (map reads to confident contigs; reads which match in all but a few bases can be ‘corrected’ to map perfectly); use mate pair info to build a ‘local’ assembly, avoid repeats and chimeras; hold trivial contigs, remove reads; make next graph; after k_max, partitions graph, clips tips, based on progressive (local) depth; Paired end reads requires long contigs to be effective.	Peng et al. 2012 [27]
BBAP	OLC	Blast-based overlap assembly, with optional intermediary assembly stage.	Lin et al. 2017 [33]
Genovo	OLC	Generative probabilistic model; applies a series of hill-climbing steps iteratively until convergence; randomly (CRP prior) picks a contig to align read ‘i’ to breaks up chimeric contigs by taking the edge reads off of contigs every ~5 iterations.	Laserson et al. 2011 [34] Afiahayati et al. 2013 [35]
IVA (iterative virus assembler)	OLC	Aimed at viruses. Greedy kmer-based extension. The most abundant kmer in the set is used as a seed, and this seed is grown out using a read that perfectly maps to it. A new kmer is drawn from the prefix of this read, which must be much more abundant than any other of the same size and occur more than 10 times in the data set.	Hunt et al. 2015 [36]
MAP	OLC	Reads are filtered before overlap (reduce pairwise alignments made), simple paths found first, mate pair support used to simplify paths, edges removed with contradictory/insufficient mate pair support.	Lai et al. 2012 [37]
MegaGTA	dBg	Guided assembly targeting specific genes. Employs HMM profile model, iterative kmers and succinct dBg.	Li et al. 2017 [38]
MetaVelvet	dBg	dBG is first built with Velvet; population structure estimated from coverage of nodes (poisson distributions); dBg is partitioned into hypothetical subgraphs (possibly different species) using these peaks as a guide; only nodes from primary distribution are considered—chimeric and repeat contigs are identified and split by paired end info and coverage differences. Assembly produced for primary distribution; procedure repeated for next.	Namiki et al. 2012 [40]
MetaVelvet-SL	dBg	Similar to MetaVelvet, but the decision for identifying chimeric contigs is done using an SVM trained on (Paired ends, coverage, contig lengths) for each dinucleotide (AA, AT...GG); a training set is generated from a similar population, the SVM is trained on this, then passed over the dBg for decomposition.	Afiahayati et al. 2015 [30]
Omega	OLC	Read prefix/suffix (+/−) are stored in hashes; graph is built of V(r); simple paths (1 in, 1 out) are contracted, and transitive edges are reduced; tips removed (<10r) and bubbles are removed (hold edges with more r); minimum cost flow analysis for short (<1000 bp) contigs; Mate pair inserts are estimated from the assembly now, used to support contigs; scaffolding with long mate pair reads; remaining unresolved contigs are merged on similar coverage.	Haider et al. 2014 [28]
PRICE	Hybrid	Reads are ‘collapsed’ if identical, then if near identical; then (single strand) dbg used to assemble (essentially)—greedy walking, start at highest coverage; identical contigs collapsed, then near identical contigs (ungapped) and finally gapped.	Ruby et al. 2013 [31]
Ray Meta	dBg	Extension of Ray—no graph partitioning performed, doesn’t use a single peak for kmer coverage, min and peak coverage are specific for each seed path; heuristics-based graph traversal; graph is coloured according to an expected taxonomic profile.	Boisvert et al. 2012 [29]
SAVAGE	OLC	Aimed at viral quasi-species recovery. Strict overlap conditions reproduce quasi-species assembly with minimal misassemblies.	Baaijens et al. 2017 [41]
Snowball	Iterative joining	Guided assembly targeting specific genes. Overlapping paired-end read are merged, then assigned to profile domains. Consensus reads assembled for each domain by iterative joining.	Gregor et al. 2016 [42]
VICUNA	Overlap	A min hash algorithm based on pairwise genetic distance threshold, inexact matching first (reads with similar or identical hash are merged) and then string matching of prefix/suffix of hashes is matched; (optional) target-like reads are kept first (similar reads binned, similarity of bin is used), everything else removed.	Yang et al. 2012 [24]
Xander	dBg	Guided assembly targeting specific genes. Employs HMM profile model.	Wang et al. 2015 [44]

dBg = De Bruijn Graph
OLC = Overlap Layout Consensus

Refer to Langmead's course for additional background reading on assembly.

References#

Metagenome-assembled genomes provide new insight into the microbial diversity of two thermal pools in Kamchatka, Russia
Choice of assembly software has a critical impact on virome characterisation
metaSPAdes: a new versatile metagenomic assembler