As a Bioinformatician, it is expected that you should be efficient with large volumes of data. You should be able to easily and effectively tackle complex tasks with entire genomes as quickly (if not quicker) as someone working with a single gene. With a favorite scripting language firmly in hand, this is not usually a problem… until it comes to genome assembly.
Many programs exist that boast faster and more accurate results than ever before; academic programs will often explain how they are made, commercial will often not. For the confused Bioinformatician the answer is usually to ‘build your own’. Unfortunately, this is not usually feasible for assembling a genome.
So how do you choose what will work best in a continuously developing field where the sequencing technologies and software are racing each other in an elaborate and confusing fashion?
After assembling 18 novel genomes, I would like to share the following wisdom:
optimization beats innovation
There are a multitude products on the market, and as of yet no one really knows which is best, and when they do, it will still only be for a given genome in a given situation.
Most of the reliable programs are based on one of two algorithms: the Overlap Layout Consensus for long reads, and the de Bruijn approach for short reads. So just pick one (safe in the knowledge that they are basically the same) and spend your time optimising the hell out of it, because everyone’s data is different… even when it is supposed to be the same!
…and here is how to do that:
- Know your data – Do you have short or long reads, paired or unpaired? What are your paired end distances? What direction do the pairs face? You WILL need to know this information!
- Filtering – Screen for vectors, screen for contamination. Do your Illumina reads contain “N’s”? Get rid of them! Low complexity? Get rid of it! Are your read ends low quality? Mask them! The more data that goes in, the more aggressive the filtering will need to be.
- Choose quickly – Software changes frequently, and often in ways that makes very little difference to the overall result. Laboring for weeks over the choice of software is confusing and ultimately pointless, as is trialling vast numbers of software (I say this from experience). Generally well established programs that are being frequently updated are the most reliable.
- Know your program – Commercial software will usually use less memory and time than academic software, but here is my caution: Do you know what it is actually doing? Getting the best results out of your chosen program is usually more effective than trying different programs. Be sure you know what it expects, and what it is actually doing.
- Scaffold carefully – The largest danger with scaffolding is knowing exactly what the software is doing and more importantly WHAT DOES IT EXPECT? Scaffolders (built-in and stand-alone) will expect pairs to be a certain distance and direction apart. This is where steps 1 and 2 come in to play… you MUST get it right. The scaffolder will not usually tell you if not, and it may not be until quite far down the line that you notice!
So those are my steps for genome optimisation, and advice on assembly. I hope they are useful to someone, because many of them were painful to learn!