Assembler for de novo assembly of large genomes

Te Chin Chu, Chen Hua Lu, Tsunglin Liu, Greg C Lee, Wen Hsiung Li, Arthur Chun Chieh Shih

Research output: Contribution to journalArticle

20 Citations (Scopus)

Abstract

Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for "jumping" extension and read "remapping." First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.

Original languageEnglish
JournalProceedings of the National Academy of Sciences of the United States of America
Volume110
Issue number36
DOIs
Publication statusPublished - 2013 Sep 3

Fingerprint

Genome
Seeds
Technology
Datasets

ASJC Scopus subject areas

  • General

Cite this

Assembler for de novo assembly of large genomes. / Chu, Te Chin; Lu, Chen Hua; Liu, Tsunglin; Lee, Greg C; Li, Wen Hsiung; Shih, Arthur Chun Chieh.

In: Proceedings of the National Academy of Sciences of the United States of America, Vol. 110, No. 36, 03.09.2013.

Research output: Contribution to journalArticle

Chu, Te Chin ; Lu, Chen Hua ; Liu, Tsunglin ; Lee, Greg C ; Li, Wen Hsiung ; Shih, Arthur Chun Chieh. / Assembler for de novo assembly of large genomes. In: Proceedings of the National Academy of Sciences of the United States of America. 2013 ; Vol. 110, No. 36.
@article{36adbd020bb44705a4d35e791370ebf9,
title = "Assembler for de novo assembly of large genomes",
abstract = "Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for {"}jumping{"} extension and read {"}remapping.{"} First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.",
author = "Chu, {Te Chin} and Lu, {Chen Hua} and Tsunglin Liu and Lee, {Greg C} and Li, {Wen Hsiung} and Shih, {Arthur Chun Chieh}",
year = "2013",
month = "9",
day = "3",
doi = "10.1073/pnas.1314090110",
language = "English",
volume = "110",
journal = "Proceedings of the National Academy of Sciences of the United States of America",
issn = "0027-8424",
number = "36",

}

TY - JOUR

T1 - Assembler for de novo assembly of large genomes

AU - Chu, Te Chin

AU - Lu, Chen Hua

AU - Liu, Tsunglin

AU - Lee, Greg C

AU - Li, Wen Hsiung

AU - Shih, Arthur Chun Chieh

PY - 2013/9/3

Y1 - 2013/9/3

N2 - Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for "jumping" extension and read "remapping." First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.

AB - Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for "jumping" extension and read "remapping." First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.

UR - http://www.scopus.com/inward/record.url?scp=84883326440&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883326440&partnerID=8YFLogxK

U2 - 10.1073/pnas.1314090110

DO - 10.1073/pnas.1314090110

M3 - Article

VL - 110

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

SN - 0027-8424

IS - 36

ER -