AGSNP is an annotation-based, genome-wide SNP discovery pipeline package using NGS data for large and complex genomes without a reference genome sequence is reported here. Roche 454 shotgun reads with low genome coverage of one individual are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another individual generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one of them generates relatively long reads. This pipeline package with a user's giude is available upon request.

You FM, Huo N, Deal KR, Gu YQ, Luo MC, McGuire PE, Dvorak J, Anderson OD. Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. BMC Genomics, 2011, 12(1):59.

Function Program Dependency
Roche 454 read annotation
BLAST search
Blast2 package
Gene annotation BWA package
Assembly gsAssembler (Newbler) (454 Life Technologies)
Removing artificial duplicates cd-hit-454
SNP discovery
Format conversion utilities
BWA package
Read mapping and SNP calling BWA package
SNP filtering SummarizeGBSSNPs.jar


Genome-wide SNP discovery in the large and complex Aegilops tauschii genome

The pipeline program package, AGSNP, was used for SNP discovery between two accessions of Ae. tauschii, AL8/78 and AS75, the parents of the F2 mapping used for the construction of an Ae. tauschii genetic map (Luo et al., 2009). Aegilops tauschii is the core genome of the Triticum-Aegilops alliance and the diploid source of the wheat D genome. Its genome is 4.02 Gb large and contains 90% repetitive sequences. It is also an important source of germplasm in wheat breeding and a diploid model for the wheat D-genome.

Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in other uncharacterized single-copy regions, and 145,907 putative SNPs were discovered in repeat junctions. SNPs were dispersed across the whole genome. DNA containing putative SNPs were PCR amplified from AL8/78 and AS75 genomic DNA and resequenced with ABI 3730xl to assess the false positive SNP discovery rate. In a sample of 186 randomly selected putative SNPs, 84.0% in gene regions and 88.0% in repeat junctions were validated.

Putative SNPs disocvered in Aegilops tauschii:
