Welcome to GSP HomePage




(An Efficient l-mer frequency genome size predictor)
HOME            DOWNLOAD           DOCUMENT          CONTACTS
REVIEWS


Workflow Chart:
PROGRAM PARAMETER:
fqKmerFreq:
useage: fqKmerFreq_v1.06: <inputList><formatFlag= 0/1 for FASTQ or FASTA><K-mer length>

Usage: gncov_v1.06 [-options] -t <tuple count> -o <output prefix>
   -t                   <tuple count file>                                                 #the tuple frequency table files, *.countMH
   -o                   <output file prefix>                                              #the output prefix  
Options:
   -l   (8-15)            <M length>                          Default:[8]           #upper boundary  freqeuncy k
   -k   (17-25)         <K-mer length>                    Default:[25]          #K-mer width
   -i   (1000-3000)   <iterate times>                     Default:[3000]      #iterative times
   -m   (0-1000)      <mutation ratio>                   Default:[200]        #ratio to disturb durbing the iteratiion
   -r   (25-100)        <reads length>                     Default:[70]          #Average read length
   -c   (0.01-5000)   <intial covreage>                   Default:[30]         #initial coverage input
   -e   (0-10)           <error cut-off>                      Default:[5]           #the K-mer frequency cut-off
   -g   Small Optimal Flag: On[Optional](2-3 fold)   Default:[OFF]       #optimize for small data set
   -s   Stable Optimal Flag: On [Optional]               Default:[OFF]       #Flag on for more stable
Compile:
./make
./make install

Run:
/bin/fqKmerFreq_v1.06  #Count the K-mer frequency
/bin/gncov_v1.06  #predict the genome size base on the K-mer freq

Sample:
sample inputlist file open
/bin/fqKmerFreq_v1.06 AE005174v2_fq.list 0 25
/bin/gncov_v1.06 -t AE005174v2_fq.list.25mer.countMH -o AE005174v2_fq.list.25mer -k 25 -r 70

Sample output file
AE005174v2_fq.list.25mer.countMH open
AE005174v2_fq.list.25mer.report open

Additional document 1  (pdf)
Additional document 2  (pdf)

TEST DATA SET
E.coli 10-fold data set (download)

Staphylococcus aureus strain MW2 data set (download)
Staphylococcus aureus strain MW2 data Result (download)

Complete Staphylococcus aureus strain MW2 data set (link)

Reference:
1.J. Shendure and H. Ji. Next-generation DNA sequencing. nature biotechnology, 26(10):1135¨C1145, 2008.
2.T.D. Harris, P.R. Buzby, H. Babcock, E. Beer, J. Bowers, I. Braslavsky, M. Causey, J. Colonell, J. DiMeo, J.W. Efcavitch, et al. Single-molecule DNA sequencing of a viral genome. Science, 320(5872):106, 2008.
3.RD Fleischmann, MD Adams, O. White, RA Clayton, EF Kirkness, AR Kerlavage, CJ Bult, JF Tomb, BA Dougherty, JM Merrick, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5223):496, 1995.
4.J. Gao and JG Scott. Use of quantitative real-time polymerase chain reaction to estimate the size of the house-fly Musca domestica genome. Insect Molecular Biology, 15(6):835¨C837, 2006.
5.J. Raes, J. Korbel, M. Lercher, C. von Mering, and P. Bork. Prediction of effective genome size in metagenomic samples. Genome Biology, 8(1):R10, 2007.
6. X. Li and M.S. Waterman. Estimating the repeat structure and length of DNA sequences using L-tuples, 2003.
7. AP Dempster, NM Laird, and DB Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1¨C38, 1977.


Copyright, 2009 - 2010, The Zhejiang University, China.  All Rights reserved.

Permission granted to download and use GSP freely for academics.  Any restrictions to use by non-academics are License need. Contact Zhejiang University Ph.D. Email: shangood@zju.edu.cn.

If you hope to known the genome size before de novo assembing, this is a definite must have. It is beyond simple!
--goodgood


What's NEW?

2010-4-7 GSP 1.06 released
2010-3- 15
   GSP 1.05 released
2010-3-9
  GSP 1.04 released
2010-3-5
  GSP 1.03 released
2010-01-28
  GSP 1.0 released
2009-12-14
  GSP Registered at SourceForge.net