Variants

Module genomvar.variant contains classes representing genomic alterations.

The hierarchy of the classes used in the package is the following:

                  VariantBase
                  /      |   \
 AmbigIndel <-Indel      V   |
   |     |      / \     MNP  V
   | ----+--- Del  Ins   |  Haplotype
   | |   |          |    V
   V V   |          V   SNP
AmbigDel -----> AmbigIns

All variants have start and end attributes defining a range they act on and can be searched overlap for.

To test whether a variant is instance of some type is_variant_instance() method can be used. Variant equality can be tested using edit_equal().

Objects can be instantiated directly, e.g.:

>>> vrt = variant.MNP('chr1',154678,'GT')
>>> print(vrt)
<MNP chr1:154678-154680 NN/GT>

This will create an MNP which substitutes positions 154678 and 154679 on chromosome 1 for GT.

Alternatively variants can be created using VariantFactory objects. This class can work with VCF-like notation. For example,

>>> fac = VariantFactory()
>>> vrt = fac.from_edit('chr15',575,'TA','T')
>>> print(vrt)
<Del chr15:576-577 A/->

Position is 0-based so it creates a deletion at position 577 of chromosome 15.

Alternatively, limited subset of HGVS notation is supported (numbering in HGVS strings is 1-based following the spec):

>>> vrt = fac.from_hgvs('chr1:g.15C>A')
>>> print(vrt)
<SNP chr1:14 C/A>

Variant sets defined in genomvar.varset use class GenomVariant. Objects of this class contain genomic alteration (attribute base) and optionally, genotype (attribute GT) and other attributes commonly found in VCF files (attribute attrib). Attribute base is an object of some VariantBase subclass (SNPs, Deletions etc.).

Variant classes

Basic classes are SNP representing a single nucleotide polimorphism and its generalization MNP – multiple nucleotide polimorphism.

class genomvar.variant.SNP(chrom, start, alt, end=None, ref=None)

Single-nucleotide polymorphism.

For instantiation it requires chromosome, position and alternative sequence.

>>> from genomvar import variant
>>> variant.SNP('chr1',154678,'T')
SNP("chr1",154678,"T")
class genomvar.variant.MNP(chrom, start, alt, end=None, ref=None)

Multiple-nucleotide polymorphism. Substitute N nucleotides of the reference for N other nucleotides.

For instantiation it requires chromosome,position and alternative sequence. end will inferred from start and alt. ref is also optional.

>>> from genomvar import variant
>>> variant.MNP('chr1',154678,'GT')
MNP("chr1",154678,"GT")

There are separate classes of insetion and deletion inheriting from abstract class Indel.

class genomvar.variant.Ins(chrom, start, alt, end=None, ref=None)

Insertion of nucleotides. For instantiation chrom, start and inserted sequence (alt) are required.

Start and end denote the nucleotide after the inserted sequence, i.e. start is 0-based number of a nucleotide after insertion, end is start+1 by definition.

>>> from genomvar.variant import Ins
# Insertion of TA before position chr2:100543.
>>> print(Ins('chr2',100543,'TA'))
<Ins chr2:100543 -/TA>
class genomvar.variant.Del(chrom, start, end, ref=None, alt=None)

Deletion of nucleotides. For instantiation chrom, start (0-based), and end (position end is excluded) are required.

>>> from genomvar.variant import Del
# Deletion of 3 nucleotides starting at chr3:7843488 (0-based)
>>> print(Del('chr3',7843488,7843488+3))
<Del chr3:7843488-7843491 NNN/->

There is a special flavor of indels for cases when it can be applied in several places resulting in the same alternate sequence, termed to as ambigous indels. They are AmbigDel, AmbigIns which on top of regular deletion or insertion attributes contain information about a region they can be applied to. For instantion VariantFactory with a reference is needed.

class genomvar.variant.AmbigIns(chrom, start, end, alt, ref=None)

Class representing indel which position is ambigous. Ambiguity means the indel could be applied in any position of some region resulting in the same alternative sequence.

Let the reference file test.fasta contain a toy sequence:

>seq1
TTTAATA

Consider a variant extending 3 T s in the beginning by one more T. It can be done in several places so the corresponding insertion can be given as and AmbigIns object:

>>> from genomvar import Reference
>>> from genomvar.variant import VariantFactory
>>> fac = VariantFactory(Reference('test.fasta'),normindel=True)
>>> print( fac.from_edit('seq1',0,'T','TT') )
<AmbigIns seq1:0-4(1-2) -/T>

Positions 1 and 2 are actual start and end meaning that T is inserted before nucleotide located 1-2. Positions 0-4 indicate that start and end can be extended to these values resulting in the same alteration.

class genomvar.variant.AmbigDel(chrom, start, end, ref=None, alt=None)

Class representing del which position is ambigous. Ambiguity means the same number of positions could be deleted in some range resulting in the same aternative sequence.

Let the reference file test.fasta contain a toy sequence:

>seq1
TCTTTTTGACTGG
>>> fac = VariantFactory(Reference('test.fasta'),normindel=True) >>>
print( fac.from_edit('seq1',1,'CTTTTTGAC','C') ) <AmbigDel
seq1:1-11(2-10) TTTTTGAC/->

Deletion of TTTTTGAC starts at 2 and ends on 9th nucleotide (including 9th resulting in range 2-10). 1-11 denote that start and end can be extended to these values resulting in the same alteration.

There is a separate class Haplotype which can hold any combination of objects of variant classes above.

class genomvar.variant.Haplotype(chrom, variants)

An object representing genome variants on the same chromosome (or contig).

Can be instantiated from a list of GenomVariant objects using Haplotype.from_variants() class method.

Additionally module contains two technically driven types: Null (no variant at all, reference), Asterisk (for * in the ALT field of VCF files).