Abstract
Accurate estimation of population allele frequency (AF) is crucial for gene discovery and genetic diagnostics. However, determining AF for frameshift-inducing small insertions and deletions (indels) faces challenges due to discrepancies in mapping and variant calling methods. Here, we propose an innovative approach to assess indel AF. We developed CRAFTS-indels (Calculating Regional Allele Frequency Targeting Small indels), an algorithm that combines AF of distinct indels within a given region and provides "regional AF" (rAF). We tested and validated CRAFTS-indels using three independent datasets: gnomAD v2 (n=125,748 samples), an internal dataset (IGM; n=39,367), and the UK BioBank (UKBB; n=469,835). By comparing rAF against standard AF, we identified rare indels with rAF exceeding standard AF (sAF≤10-4 and rAF>10-4) as "rAF-hi" indels. Notably, a high percentage of rare indels were "rAF-hi", with a higher proportion in gnomAD v2 (11-20%) and IGM (11-22%) compared to the UKBB (5-9% depending on the CRAFTS-indels' parameters). Analysis of the overlap of regions based on their rAF with low complexity regions and with ClinVar classification supported the pertinence of rAF. Using the internal dataset, we illustrated the utility of CRAFTS-indel in the analysis of de novo variants and the potential negative impact of rAF-hi indels in gene discovery. In summary, annotation of indels with cohort specific rAF can be used to handle some of the limitations of current annotation pipelines and facilitate detection of novel gene disease associations. CRAFTS-indels offers a user-friendly approach to providing rAF annotation. It can be integrated into public databases such as gnomAD, UKBB and used by ClinVar to revise indel classifications.</p>