20
Utilization of Sequence on Relatives to Improve Analysis of Individuals' Low-coverage NGS data
Low-coverage sequence data is expected to have low call rates under the prevailing paradigm under which genotypes are first “called” from sequence data of each individual independently and subsequent analyses (including determination of haplotypes) are dependent on those called genotypes. However, provided 200+ individuals are sequenced, the number of haplotypes present in the region surrounding a gene should typically be considerably smaller than the number of individuals, so the effective sequence coverage per haplotype should be considerably higher than the coverage per individual, especially for the most heavily represented haplotypes. Given a set of haplotypes spanning the population for a defined genomic region, the likelihood of each sequencing read of an individual (that has been mapped to that region) having originated from each of the haplotypes can be computed. Pooling those likelihoods over the reads of each individual provides the likelihood of each individual having each haplotype, and conditioning on the pedigree through a peeling algorithm provides the probability distribution for each individual’s paternal and maternal haplotypes. Provided an individual has 100+ sequencing reads and there is sufficient pedigree structure, these distributions should often be relatively unambiguous. The probabilities of assigning haplotypes to each individual are combined with the likelihoods of the reads to compute posterior probabilities that assign reads to haplotypes. For individuals whose haplotypes are determined unambiguously, there are three possible cases: the read is assigned unambiguously to the haplotype if the individual is homozygous; the read will usually be assigned unambiguously if the individual has two haplotypes with different sequences corresponding to the read; and the read will be assigned with equal probability to two haplotypes with identical sequences corresponding to the read. The reads assigned (probabilistically) to each haplotype are pooled over individuals and assembled to improve its sequence, aided by the generally deeper coverage and the homogeneous, haploid nature of haplotypes as compared to individuals. An iterative algorithm to take advantage of these concepts has been developed and is being tested on a 13 kb region surrounding the myostatin gene on 268 beef bulls (including 80 sire-son pairs) of seven breeds and their crosses that have genomic sequence at an average depth of approximately 2X. This algorithm is based on the alternative paradigm of determining the underlying haplotypes directly from the sequence data and pedigree and then deriving genotypes (if needed) and performing other analyses subsequently.
USDA is an equal opportunity provider and employer.
Keywords: low-coverage NGS data