After 2 years of further NGS development, I think it’s finally time to update my comments on de novo genome assembly.
Some things that I stated remain true; that it is better to optimize your assembly protocol than to try all the different software on the market. And wonderfully enough, we now have the Assemblathon 2 to help us make our choices.
Sadly some things did not come to pass:
Although Oxford Nanopore are finally letting people work with their data, these miracle machines are not delivering the read lengths dreamed up two years ago (apparently a mean read length of 5.4kb, ranging to a rather impressive ~147k, although these reads appear to be rare and problematic).
However, some things have improved:
PacBio, in my opinion moves from strength to strength, but still with an alarming error rate (20% indels!? – From my own data that is). But more on this another time.
And now all of a sudden we have the gift of 300bp Illumina MiSeq reads, which is pretty cool.
EDIT// And Illumina Long Jumping Distance reads with insert sizes up tp 40kb?? Illumina, you are spoiling us.
And cooler still; BioNano promises to scaffold our genomes like nothing before it, as well as a myriad of other promises, and I am just now beginning to get my hands on some of this data.
I’ll let you know how it goes…
These last few weeks (after finally settling into this new country) have been devoted to the improvement of the Mercurialis genome that I have been assembling – no easy task.
Currently I am working on a number of strategies that myself and the brilliant Oksana Riba Grognuz have been discussing. These include:
- Using the program FLASH to merge overlapping Illumina paired end reads and re-assembly with Newbler or AllPathsLG
- Using the pre-release of SOAPdenovo2
- Patching scaffold gaps using a program that I am developing (more to come on this later!)
However… what has been taking up most of my time lately has been: E. coli CONTAMINATION of my raw reads.
Now most papers that I have read on this matter suggest that the low coverage and lack of homology with these reads will ensure their removal during the graph-building stage. Sounds sensible. So I was rather surprised when, as a standard part of checking ‘things that might have ruined my assembly’, I used BLAST to compare my scaffolds to the E. coli genome and found around one reads-worth of this pesky bacterium perched on the ends of a number of my scaffolds.
My original suspicion was that there was simply TEs or other homologous elements lurking in my rather messy plant genome. Unfortunately, a closer inspection revealed no homology in the middle of any of my scaffolds, making this less likely. But still, we all know that TEs are hard to assemble and my paired end distances are short (well we might not all know that), so it is still possible.
I have just been supplied with some mate pair data, and I have removed reads homologous to E. coli from my data. I shall be trying various combinations and strategies to get to the bottom of this story, and I will let you know how it plays out.
In Switzerland, the air is cold and clean, the food is expensive and the people seem to be friendly.
My new postdoctoral position came with a shiny new affiliation to the Swiss Institute of Bioinformatics (SIB), and I found myself invited to the SIB days conference before had even begun working – a slightly embarrassing conversation starter when people ask who you work with and you don’t know.
The reason for this confusion being, of course, that working for the SIB means being under the banner of a Group Leader. A wise and extremely computer literate young(ish) guru who provides you with advice, contacts, and most importantly, a MAC. These mighty Bioinformaticians run huge labs of researchers who range from the biologist who can hold there own with a Markov Model to the dedicated computer scientists who may not know what a gene does exactly but can tell you exactly how it is programmed into their databases and where the memory cache would be most efficient in order to retrieve complex queries.
Working with a group leader does not necessarily mean a lot of contact with them, as your lab head does not have to be your group leader. Hence my not knowing who my group leader was.
The conference began with a long talk on our duties; a bizarrely motivating experience where we were treated like a special operations unit who have been tasked with being deployed (or embedded as we say here) into the most inhospitable of biology labs in order to educate, advise and code. Fortunately this mission statement has the desired effect of making you want to immediately find yourself a wet-lab worker and explain in fine detail how you could make their data entries more efficient and how it is your duty, responsibility and most desperate desire to do so.
Basically they make you feel special.
With this in mind – now that it has finally been confirmed that I am special – I will start to update more regularly with anything that I have learned that might be worthwhile to others.
Unfortunately, seemingly as with most of science, there is never a perfect way of doing something. Or anything really. But perhaps knowing how this massive collective of Swiss Bioinformaticians do things might make others feel special too.
As a Bioinformatician, it is expected that you should be efficient with large volumes of data. You should be able to easily and effectively tackle complex tasks with entire genomes as quickly (if not quicker) as someone working with a single gene. With a favorite scripting language firmly in hand, this is not usually a problem… until it comes to genome assembly.
Many programs exist that boast faster and more accurate results than ever before; academic programs will often explain how they are made, commercial will often not. For the confused Bioinformatician the answer is usually to ‘build your own’. Unfortunately, this is not usually feasible for assembling a genome.
So how do you choose what will work best in a continuously developing field where the sequencing technologies and software are racing each other in an elaborate and confusing fashion?
After assembling 18 novel genomes, I would like to share the following wisdom:
optimization beats innovation
There are a multitude products on the market, and as of yet no one really knows which is best, and when they do, it will still only be for a given genome in a given situation.
Most of the reliable programs are based on one of two algorithms: the Overlap Layout Consensus for long reads, and the de Bruijn approach for short reads. So just pick one (safe in the knowledge that they are basically the same) and spend your time optimising the hell out of it, because everyone’s data is different… even when it is supposed to be the same!
…and here is how to do that:
- Know your data – Do you have short or long reads, paired or unpaired? What are your paired end distances? What direction do the pairs face? You WILL need to know this information!
- Filtering – Screen for vectors, screen for contamination. Do your Illumina reads contain “N’s”? Get rid of them! Low complexity? Get rid of it! Are your read ends low quality? Mask them! The more data that goes in, the more aggressive the filtering will need to be.
- Choose quickly – Software changes frequently, and often in ways that makes very little difference to the overall result. Laboring for weeks over the choice of software is confusing and ultimately pointless, as is trialling vast numbers of software (I say this from experience). Generally well established programs that are being frequently updated are the most reliable.
- Know your program – Commercial software will usually use less memory and time than academic software, but here is my caution: Do you know what it is actually doing? Getting the best results out of your chosen program is usually more effective than trying different programs. Be sure you know what it expects, and what it is actually doing.
- Scaffold carefully – The largest danger with scaffolding is knowing exactly what the software is doing and more importantly WHAT DOES IT EXPECT? Scaffolders (built-in and stand-alone) will expect pairs to be a certain distance and direction apart. This is where steps 1 and 2 come in to play… you MUST get it right. The scaffolder will not usually tell you if not, and it may not be until quite far down the line that you notice!
So those are my steps for genome optimisation, and advice on assembly. I hope they are useful to someone, because many of them were painful to learn!
Today marks the creation of my very own Bioinformatics website, which is likely to be constantly evolving in a glorious Lamarckian style!
I will be updating every month, or whenever there is something to say.
For now I will leave with my favourite scientific quote taken out of context (I wish I could remember where exactly it was from):
… nothing would change if time were to flow backwards…
Jordan et al., 2005
Posted in Welcome