Variant Module
This module defines classes for representing and annotating different types of genetic variants, including SNVs, indels, structural variants, and tandem repeats. This module is not meant for you to store your variant for a whole genome or exome sequencing. Currently there is no support for storing a large number of variants (in the order of 100s of millions, which would be about 40-50 WGS samples). That support might come in the future.
If you have a smaller subset of variants that is the result of a filtered vcf file you might be able to use this to represent them and store them in the knowledgebase database.
Classes
BaseVariant
Description:
Base class for all variant types. Stores core attributes such as chromosome, position, filter status, ID, and annotations.
Public Methods & Usage:
from benchmate.variant.variant import BaseVariant
# Create a base variant
variant = BaseVariant(chrom="1", pos=12345, filter="PASS")
# Add an annotation
variant.add_annotation("impact", "HIGH")
# Query an annotation
impact = variant.query_annotation("impact")
SequenceVariant
Description:
Represents SNV and indel variants. Extends BaseVariant
with reference/alternate alleles and sample/callset-specific fields.
Public Methods & Usage:
from benchmate.variant.variant import SequenceVariant
# Create a sequence variant
seq_var = SequenceVariant(
chrom="1", pos=12345, ref="A", alt="T", qual=99.0, gt="0/1", dp=30
)
# Add and query annotations (inherited)
seq_var.add_annotation("gene", "BRCA1")
gene = seq_var.query_annotation("gene")
StructuralVariant
Description:
Represents structural variants (e.g., INS, DEL, INV, DUP, BND, CNV). Extends BaseVariant
with SV-specific fields.
Public Methods & Usage:
from benchmate.variant.variant import StructuralVariant
# Create a structural variant
sv = StructuralVariant(
chrom="2", pos=20000, svtype="DEL", end=20500, svlen=500, gt="1/1"
)
# Annotate and query
sv.add_annotation("clinical_significance", "pathogenic")
significance = sv.query_annotation("clinical_significance")
TandemRepeatVariant
Description:
Represents tandem repeat variants, including repeat motif, allele length, and sample-specific metrics.
Public Methods & Usage:
from benchmate.variant.variant import TandemRepeatVariant
# Create a tandem repeat variant
tr = TandemRepeatVariant(
chrom="3", pos=30000, end=30020, motif="CAG", al=10, gt="0/1"
)
# Annotate and query
tr.add_annotation("repeat_expansion", True)
is_expanded = tr.query_annotation("repeat_expansion")
You can convert these variants to HGVS format using the to_hgvs
method:
While you can use this function on its own for your own, it is also useful to be used in the api.ensemble.Ensembl.vep method among others.
from benchmate.variant.variant import SequenceVariant
from benchmate.variant.utils import to_hgvs
# Convert to HGVS format
seq_var = SequenceVariant(
chrom="1", pos=12345, ref="A", alt="T", qual=99.0, gt="0/1", dp=30
)
hgvs_variant = to_hgvs(seq_var)
How the data is stored:
The variants are stored in a knowledge base database, which allows for efficient querying and retrieval of variant information. Each variant type has its own table with fields corresponding to the attributes defined in the classes. Annotations are stored in a separate table linked to the variant tables, allowing for flexible and extensible annotation of variants. Annotations are basic JSON
columns which are then converted to BSON
by postgres and are then back loaded as dictionaries. Currently we have not structured the database to support billions of variants and their annotations that you would get form a large scale GWAS study. If you are planning to use this as a variant store you can for your filtered variants. Also keep in mind that there is no column for distinguishing samples. You will need a secondary table to track the variant->sample connection.
Please create an issue if you think this is a desirable feature.