GSP Document

PROGRAM PARAMETER:

fqKmerFreq:

useage: fqKmerFreq_v1.06: <inputList><formatFlag= 0/1 for FASTQ or FASTA><K-mer length>

Usage: gncov_v1.06 [-options] -t <tuple count> -o <output prefix>

-t <tuple count file> #the tuple frequency table files, *.countMH

-o <output file prefix> #the output prefix

Options:

-l (8-15) <M length> Default:[8] #upper boundary freqeuncy k

-k (17-25) <K-mer length> Default:[25] #K-mer width

-i (1000-3000) <iterate times> Default:[3000] #iterative times

-m (0-1000) <mutation ratio> Default:[200] #ratio to disturb durbing the iteratiion

-r (25-100) <reads length> Default:[70] #Average read length

-c (0.01-5000) <intial covreage> Default:[30] #initial coverage input

-e (0-10) <error cut-off> Default:[5] #the K-mer frequency cut-off

-g Small Optimal Flag: On[Optional](2-3 fold) Default:[OFF] #optimize for small data set

-s Stable Optimal Flag: On [Optional] Default:[OFF] #Flag on for more stable

Compile:

./make

./make install

Run:

/bin/fqKmerFreq_v1.06 #Count the K-mer frequency

/bin/gncov_v1.06 #predict the genome size base on the K-mer freq

Sample:

sample inputlist file open

/bin/fqKmerFreq_v1.06 AE005174v2_fq.list 0 25

/bin/gncov_v1.06 -t AE005174v2_fq.list.25mer.countMH -o AE005174v2_fq.list.25mer -k 25 -r 70

Sample output file

AE005174v2_fq.list.25mer.countMH open

AE005174v2_fq.list.25mer.report open

Additional document 1 (pdf)

Additional document 2 (pdf)

TEST DATA SET

E.coli 10-fold data set (download)

Staphylococcus aureus strain MW2 data set (download)

Staphylococcus aureus strain MW2 data Result (download)

Complete Staphylococcus aureus strain MW2 data set (link)

Reference:

1.J. Shendure and H. Ji. Next-generation DNA sequencing. nature biotechnology, 26(10):1135¨C1145, 2008.

2.T.D. Harris, P.R. Buzby, H. Babcock, E. Beer, J. Bowers, I. Braslavsky, M. Causey, J. Colonell, J. DiMeo, J.W. Efcavitch, et al. Single-molecule DNA sequencing of a viral genome. Science, 320(5872):106, 2008.

3.RD Fleischmann, MD Adams, O. White, RA Clayton, EF Kirkness, AR Kerlavage, CJ Bult, JF Tomb, BA Dougherty, JM Merrick, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5223):496, 1995.

4.J. Gao and JG Scott. Use of quantitative real-time polymerase chain reaction to estimate the size of the house-fly Musca domestica genome. Insect Molecular Biology, 15(6):835¨C837, 2006.

5.J. Raes, J. Korbel, M. Lercher, C. von Mering, and P. Bork. Prediction of effective genome size in metagenomic samples. Genome Biology, 8(1):R10, 2007.

6. X. Li and M.S. Waterman. Estimating the repeat structure and length of DNA sequences using L-tuples, 2003.

7. AP Dempster, NM Laird, and DB Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1¨C38, 1977.