: Publication 16441

Publication 16441

Title:	IGD: a simple, efficient genotype data format
Journal:	Bioinformatics Advances
Published:	26 Dec 2024
Pubmed:	https://pubmed.ncbi.nlm.nih.gov/40980555/
DOI:	https://doi.org/10.1093/bioadv/vbaf205
URL:	https://academic.oup.com/bioinformaticsadvances/advance-article-pdf/doi/10.1093/bioadv/vbaf205/64140915/vbaf205.pdf

WARNING: the interactive features of this website use CSS3, which your browser does not support. To use the full features of this website, please update your browser.

Abstract

Motivation: While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement-yet fast and small-is helpful for research on highly scalable statistical and population genetics methods.</p>

Results: We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100× faster and 3.5× smaller than vcf.gz on biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format.</p>

Availability and implementation: A C++ library for reading and writing IGD, and tooling to convert .vcf.gz files, can be found at https://github.com/aprilweilab/picovcf. A Python library is at https://github.com/aprilweilab/pyigd.</p>

2 Authors

Drew DeHaas
Xinzhu Wei

1 Application

Application ID	Title
97908	Scalable and accurate methods for understanding human complex traits

Enabling scientific discoveries that improve human health