Assembler for de novo assembly of large genomes

Te Chin Chu, Chen Hua Lu, Tsunglin Liu, Greg C. Lee, Wen Hsiung Li, Arthur Chun Chieh Shih

Research output: Contribution to journalArticle

20 Citations (Scopus)

Abstract

Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for "jumping" extension and read "remapping." First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.

Original languageEnglish
Pages (from-to)E3417-E3424
JournalProceedings of the National Academy of Sciences of the United States of America
Volume110
Issue number36
DOIs
Publication statusPublished - 2013 Sep 3

Fingerprint

Genome
Seeds
Technology
Datasets

ASJC Scopus subject areas

  • General

Cite this

Assembler for de novo assembly of large genomes. / Chu, Te Chin; Lu, Chen Hua; Liu, Tsunglin; Lee, Greg C.; Li, Wen Hsiung; Shih, Arthur Chun Chieh.

In: Proceedings of the National Academy of Sciences of the United States of America, Vol. 110, No. 36, 03.09.2013, p. E3417-E3424.

Research output: Contribution to journalArticle

Chu, Te Chin ; Lu, Chen Hua ; Liu, Tsunglin ; Lee, Greg C. ; Li, Wen Hsiung ; Shih, Arthur Chun Chieh. / Assembler for de novo assembly of large genomes. In: Proceedings of the National Academy of Sciences of the United States of America. 2013 ; Vol. 110, No. 36. pp. E3417-E3424.
@article{36adbd020bb44705a4d35e791370ebf9,
title = "Assembler for de novo assembly of large genomes",
abstract = "Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for {"}jumping{"} extension and read {"}remapping.{"} First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.",
author = "Chu, {Te Chin} and Lu, {Chen Hua} and Tsunglin Liu and Lee, {Greg C.} and Li, {Wen Hsiung} and Shih, {Arthur Chun Chieh}",
year = "2013",
month = "9",
day = "3",
doi = "10.1073/pnas.1314090110",
language = "English",
volume = "110",
pages = "E3417--E3424",
journal = "Proceedings of the National Academy of Sciences of the United States of America",
issn = "0027-8424",
number = "36",

}

TY - JOUR

T1 - Assembler for de novo assembly of large genomes

AU - Chu, Te Chin

AU - Lu, Chen Hua

AU - Liu, Tsunglin

AU - Lee, Greg C.

AU - Li, Wen Hsiung

AU - Shih, Arthur Chun Chieh

PY - 2013/9/3

Y1 - 2013/9/3

N2 - Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for "jumping" extension and read "remapping." First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.

AB - Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for "jumping" extension and read "remapping." First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.

UR - http://www.scopus.com/inward/record.url?scp=84883326440&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883326440&partnerID=8YFLogxK

U2 - 10.1073/pnas.1314090110

DO - 10.1073/pnas.1314090110

M3 - Article

C2 - 23966565

AN - SCOPUS:84883326440

VL - 110

SP - E3417-E3424

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

SN - 0027-8424

IS - 36

ER -