Today a colleague of mine asked the following question:
" Assuming I need to build from 0, a chromosome of a fish, with short reads but no other reference whatsoever [de novo assembly]:
- how much work is that?
- Is there a generic software (like SAMtools) that will align the reads in a scaffold one can use?
- Basically, given a reasonably clear pipeline in terms of software, is it still blood sweat and tears or is it just a matter of getting it on a cluster?"
Very grateful for any suggestions, sources of information, software etc.
Answer
If you only want to use only sequencing techniques, you have a problem.
To get a feeling of what kind of results to expect, consider this paper published recently in Nature Genetics. They tried to assemble a whale genome de novo. They had 7 (!) paired-end libraries with different insert lengths ranging from 170bp to 20kb. Read lengths were mostly 100bp and in some cases 49bp. Average genome coverage was 91x.
Assembling this extensive data, they end up with over 100,000 contigs when the assembly is done.
So you really can't get a high-quality complex (i.e. large) genome assembled from only short-read sequencing data using the standard techniques.
That said, recent approaches such as libraries with much longer reads lengths (here) or the use of Hi-C data (here and here) do offer a way of getting high-quality complex genome assemblies using only sequencing data.
No comments:
Post a Comment