Abstract
Motivation: While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement-yet fast and small-is helpful for research on highly scalable statistical and population genetics methods.</p>
Results: We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100× faster and 3.5× smaller than vcf.gz on biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format.</p>
Availability and implementation: A C++ library for reading and writing IGD, and tooling to convert .vcf.gz files, can be found at https://github.com/aprilweilab/picovcf. A Python library is at https://github.com/aprilweilab/pyigd.</p>