Abstract
The Li and Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel. For small panels, the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics. However, LS becomes inefficient when sample size is large, because of its linear time complexity. Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer a fast method for giving some optimal solution (Viterbi) to the LS HMM. Previously, we introduced the minimal positional substring cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank-scale panels on which the LS model is infeasible. Here, we present new results on the solution space of the MPSC. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the length maximal MPSC, and h-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.</p>