Loading...
Please wait, while we are loading the content...
Similar Documents
Whole genome sequencing of a single Bos taurus animal for SNP discovery
| Content Provider | Semantic Scholar |
|---|---|
| Author | Eck, Sebastian H. Benet-Pagès, Anna Flisikowski, Krzysztof Meitinger, Thomas Fries, Ruedi Strom, Tim Matthias |
| Copyright Year | 2009 |
| Abstract | Background: The majority of the 2 million bovine SNPs currently available in dbSNP have been identified in a single breed, Hereford cattle, during the bovine genome project. In an attempt to evaluate the variance of a second breed, we have produced a whole genome sequence at low coverage of a single Fleckvieh bull. Results: We generated 24 gigabases of sequence, mainly using 36-bp paired-end reads, resulting in an average 7.4 fold sequence depth. This coverage was sufficient to identify 2.44 million SNPs, 82% of which were previously unknown, and 115,000 small indels. 9360 of the SNPs cause non-synonymous substitutions within coding regions. A comparison with the genotypes of the same animal generated on a 50k oligonucleotide chip revealed a detection rate of 74% and 30% for homozygous and heterozygous SNPs, respectively. We further determined the allele frequencies of 196 SNPs in 48 Fleckvieh and 48 Braunvieh bulls. 95% of the SNPs were polymorphic with an average minor allele frequency of 24.5%. The distribution of the minor allele frequency of tested SNPs was nearly uniform with 83% of the SNPs having a minor allele frequency larger than 5% Conclusions: This work provides the first single cattle genome by next-generation sequencing. The chosen approach low to medium coverage re-sequencing added more than 2 million novel SNPs to the currently publicly available SNP resource providing a valuable resource for the construction of high density oligonucleotide arrays in the context of genome-wide association studies. Background The bovine reference genome sequence assembly resulted from the combination of shotgun and bacterial artificial chromosome (BAC) sequencing of an inbred Hereford cow and her sire using capillary sequencing. The majority of more than two million bovine SNPs deposited in dbSNP represent polymorphisms detected in these two Hereford animals [1]. Recently, van Tassel et al. [2] contributed more than twenty-three thousand SNPs to the bovine SNP collection by next-generation sequencing of reduced representation libraries. The study involved 66 cattle representing different lines of a dairy breed (Holstein) and the 7 most common beef breeds (Angus, Red Angus, Charolais, Gelbvieh, Hereford, Limousin and Simmental). These SNPs together with SNPs deposited in dbSNP were used to compile arrays with up to 50,000 SNPs. The arrays have been used to implement a new approach to animal breeding, termed genomic selection [3, 4]. Although this approach has been applied successfully to predict breeding values in dairy cattle, the underlying SNP resource is far from complete. SNP selection for the Illumina BovineSNP50 array for instance has been optimized to provide high MAFs for the Holstein breed. The full extent of common SNP variation in Holstein and other breeds is still unexplored. Although the average r 2 between adjacent markers of the BovineSNP50 array is greater than 0.2, the minimal linkage disequilibrium required for genomic prediction to be sufficiently accurate, there is a considerable number of marker pairs with zero r 2 [3]. Since preliminary data indicate that the extent of LD in cattle breeds is only slightly larger than in humans, it has been estimated that up to 300,000 SNPs will be necessary to achieve optimal marker coverage throughout the cattle genome [5-8]. Circumventing any pooling or enrichment protocols, we sequenced just a single Fleckvieh animal to identify a large number of candidate SNPs. We demonstrate that this approach represents an effective strategy towards a comprehensive resource for common SNPs. Results and Discussion Sequencing and alignment The genomic DNA sequenced in this study was obtained from a single blood sample of a Fleckvieh breeding bull. Whole-genome sequencing was performed on an Illumina Genome Analyzer II using 3 different small-insert paired-end libraries. We generated 36 bp reads on 44 paired-end lanes and 9 single-end lanes resulting in 24 Gb of mappable sequence. 87% of the aligned bases had a phred-like quality score of 20 or more, as calculated by the ELAND alignment software [9]. To account for the varying read quality, we trimmed the ends of reads when necessary to a minimum of 32 bases. Read mapping, subsequent assembly and SNP calling was performed using the re-sequencing software MAQ [10]. Apparently duplicated paired-end reads (7.6%) were removed. 605,630,585 (93.6%) of the pairedend reads were successfully mapped in mate-pairs to the assembly bosTau4.0 from October 2007 [11], which has a length of 2.73 Gb. Additionally, 23,872,053 of paired-end reads (3.6%) were mapped as singles. Of the 25,808,311 single-end reads, 93.2 % could be aligned to the genome. Together, 98.0% of the genome (98.1% of the autosomes and 93.9% of the X chromosome) was covered by reads resulting in a 7.4-fold coverage across the entire genome (7.58-fold across the autosomes and 4.13-fold across the X chromosome) and a 6.2-fold sequence depth using only the uniquely aligned reads. The final distribution of mapped read depth sampled at every position of the autosomal chromosomes showed a slight over-dispersion compared to the Poisson distribution giving the theoretical minimum (Fig. 1a). Part of this over-dispersion can be accounted for by the dependence of the read depth on the GCcontent which had a maximum average read depth at approximately 57% GC-content (Fig. 1b) [9, 12]. SNP and indel detection We focused our further analysis on SNP identification. We applied stringent criteria in order to keep the false-positive detection rate low. An outline of the analysis procedure, comprising SNP identification and validation, is given in Figure 2. SNPs were called with the MAQ software. Using mainly the default parameters, particularly a minimum read depth of 3 and a minimum consensus quality of 20, SNPs could be assessed in sequence reads which together comprised 68% (1.87 Gb) of the genome. To exclude sequencing artifacts that we have observed in other experiments, the output of MAQ was further filtered using custom developed scripts. These artifacts include cases where all sequenced variant alleles at a given position are only indicated by reads from one strand and have a lower than average base quality at the variant position. We required for a SNP call that the average base quality is >= 20 and that at least 20% of the reads are from opposite strands. Using these parameters, the MAQ software called 2,921,556 million putative SNPs which were reduced by our custom filters to a final set of 2.44 million SNPs. Of these SNPs, 1,694,546 (69.4%) were homozygous and 749,091 (30.6%) were heterozygous. The low proportion of heterozygous SNPs is mainly due to the relatively low sequence depth and our stringent SNP calling requirements. The rate of heterozygous SNP detection is expected to rise with increasing coverage (Additional data file 1). It has been estimated that at least 20-30 fold coverage is needed to detect 99% of the heterozygous variants [10]. We further performed a genome-wide survey of small insertion and deletion events (indels). Indels called by MAQ were only retained if they were indicated by at least 10% of high-quality reads from each strand. This criterion was applied to exclude possible sequencing artifacts and resulted in the identification of 115,371 indels (68,354 deletions and 47,017 insertions). The majority of them had a length of 1-4 base-pairs with the largest having a length of 15 bp (Fig. 3). Next we compared the identified SNP and indel variants with those already published. Since the dbSNP set is not yet mapped to the bosTau4 assembly, we compared our findings with the 2.08 million SNPs mapped by the Baylor College Bovine Genome Project. The comparison showed that 18% (451,914) of the SNPs were shared between both sets (Table 1). Functional annotation We used the RefSeq (9518 genes) and Ensembl (28,045 genes) gene sets to functionally annotate the detected variants (Table 1). Using the RefSeq genes as reference, we found 7619 coding SNPs (3139 leading to non-synonymous amino-acid substitutions), 40 SNPs at canonical splice sites and 6292 SNPs in untranslated regions (UTR). Additionally, 203 indels were located in coding regions, with almost all of them (201) causing a frame-shift in the corresponding gene. The remaining two indels comprise single amino-acid deletions. The Ensembl gene set is larger and includes also gene predictions. Thus, more variants are detected using this set. We identified 22,070 coding SNPs (9360 non-synonymous substitutions), 148 SNPs at donor or acceptor splice sites and 8114 SNPs in UTR regions. Furthermore, we identified 425 indels in Ensembl annotated coding regions. The majority of them (414) cause a frame-shift in the reading frame of the associated gene, 9 indels lead to single amino-acid deletions and 2 were single amino acid insertions. Comparison of sequence and array results We assessed the accuracy and completeness of the sequence-based SNP calls by comparing them with the genotypes of the same animal generated with an Illumina BovineSNP50 array. This chip contains 54,001 SNPs of which 48,188 map to the current assembly (bosTau4). Of those, 48,025 SNPs were successfully genotyped. 22,299 homozygous calls exhibited the reference allele, leaving 12,043 homozygous and 13,683 heterozygous SNPs which were different with respect to the reference sequence assembly. We used these 25,726 positions together with 16 positions where only the MAQ call differed from the reference sequence to examine the accuracy and sensitivity of SNP calling in more detail. We first estimated the proportion of concordant calls. Of the 12,043 homozygous array-based calls that differed from the reference sequence, 8974 (74.51%) were also called by MAQ. In 8949 (99.72%) of these positions, both platforms showed concordant genotypes. Of the 13,683 heterozygous array-based calls, MAQ did only call 5882 (42.98%) positions, and only 4157 (70.67%) of |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://genomebiology.com/content/pdf/gb-2009-10-8-r82.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |