Package contents

Module variant.py

Module genomvar.variant contains classes representing genomic alterations.

The hierarchy of the classes used in the package is the following:

                  VariantBase
                  /      |   \
 AmbigIndel <-Indel      V   |
   |     |      / \     MNP  V
   | ----+--- Del  Ins   |  Haplotype
   | |   |          |    V
   V V   |          V   SNP
AmbigDel -----> AmbigIns

All variants have start and end attributes defining a range they act on and can be searched overlap for.

To test whether a variant is instance of some type is_variant_instance() method can be used. Variant equality can be tested using edit_equal().

Objects can be instantiated directly, e.g.:

>>> vrt = variant.MNP('chr1',154678,'GT')
>>> print(vrt)
<MNP chr1:154678-154680 NN/GT>

This will create an MNP which substitutes positions 154678 and 154679 on chromosome 1 for GT.

Alternatively variants can be created using VariantFactory objects. This class can work with VCF-like notation. For example,

>>> fac = VariantFactory()
>>> vrt = fac.from_edit('chr15',575,'TA','T')
>>> print(vrt)
<Del chr15:576-577 A/->

Position is 0-based so it creates a deletion at position 577 of chromosome 15.

Alternatively, limited subset of HGVS notation is supported (numbering in HGVS strings is 1-based following the spec):

>>> vrt = fac.from_hgvs('chr1:g.15C>A')
>>> print(vrt)
<SNP chr1:14 C/A>

Variant sets defined in genomvar.varset use class GenomVariant. Objects of this class contain genomic alteration (attribute base) and optionally, genotype (attribute GT) and other attributes commonly found in VCF files (attribute attrib). Attribute base is an object of some VariantBase subclass (SNPs, Deletions etc.).

class genomvar.variant.VariantBase(chrom, start, end, ref, alt)

Base class for the other genomic variant classes.

edit_equal(other)

Returns True if self represents the same alteration as the other

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

property vtp

Variant class

property key

Tuple representation of variant.

class genomvar.variant.MNP(chrom, start, alt, end=None, ref=None)

Multiple-nucleotide polymorphism. Substitute N nucleotides of the reference for N other nucleotides.

For instantiation it requires chromosome,position and alternative sequence. end will inferred from start and alt. ref is also optional.

>>> from genomvar import variant
>>> variant.MNP('chr1',154678,'GT')
MNP("chr1",154678,"GT")
edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.SNP(chrom, start, alt, end=None, ref=None)

Single-nucleotide polymorphism.

For instantiation it requires chromosome, position and alternative sequence.

>>> from genomvar import variant
>>> variant.SNP('chr1',154678,'T')
SNP("chr1",154678,"T")
edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.Indel(chrom, start, end, ref, alt)

Abstract class to accomodate insertion/deletion subclasses

ambig_equal(other)

Returns true if two variants are equal up to indel ambiguity.

edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.Ins(chrom, start, alt, end=None, ref=None)

Insertion of nucleotides. For instantiation chrom, start and inserted sequence (alt) are required.

Start and end denote the nucleotide after the inserted sequence, i.e. start is 0-based number of a nucleotide after insertion, end is start+1 by definition.

>>> from genomvar.variant import Ins
# Insertion of TA before position chr2:100543.
>>> print(Ins('chr2',100543,'TA'))
<Ins chr2:100543 -/TA>
ambig_equal(other)

Returns true if two variants are equal up to indel ambiguity.

edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.Del(chrom, start, end, ref=None, alt=None)

Deletion of nucleotides. For instantiation chrom, start (0-based), and end (position end is excluded) are required.

>>> from genomvar.variant import Del
# Deletion of 3 nucleotides starting at chr3:7843488 (0-based)
>>> print(Del('chr3',7843488,7843488+3))
<Del chr3:7843488-7843491 NNN/->
ambig_equal(other)

Returns true if two variants are equal up to indel ambiguity.

edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.AmbigIndel(chrom, start, end, ref, alt)

Class representing indel which position is ambigous. Ambiguity means the indel could be applied in any position of some region resulting in the same alternative sequence.

ambig_equal(other)

Returns true if two variants are equal up to indel ambiguity.

edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.AmbigIns(chrom, start, end, alt, ref=None)

Class representing indel which position is ambigous. Ambiguity means the indel could be applied in any position of some region resulting in the same alternative sequence.

Let the reference file test.fasta contain a toy sequence:

>seq1
TTTAATA

Consider a variant extending 3 T s in the beginning by one more T. It can be done in several places so the corresponding insertion can be given as and AmbigIns object:

>>> from genomvar import Reference
>>> from genomvar.variant import VariantFactory
>>> fac = VariantFactory(Reference('test.fasta'),normindel=True)
>>> print( fac.from_edit('seq1',0,'T','TT') )
<AmbigIns seq1:0-4(1-2) -/T>

Positions 1 and 2 are actual start and end meaning that T is inserted before nucleotide located 1-2. Positions 0-4 indicate that start and end can be extended to these values resulting in the same alteration.

property seq

Inserted sequence.

ambig_equal(other)

Returns true if two variants are equal up to indel ambiguity.

edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.AmbigDel(chrom, start, end, ref=None, alt=None)

Class representing del which position is ambigous. Ambiguity means the same number of positions could be deleted in some range resulting in the same aternative sequence.

Let the reference file test.fasta contain a toy sequence:

>seq1
TCTTTTTGACTGG
>>> fac = VariantFactory(Reference('test.fasta'),normindel=True) >>>
print( fac.from_edit('seq1',1,'CTTTTTGAC','C') ) <AmbigDel
seq1:1-11(2-10) TTTTTGAC/->

Deletion of TTTTTGAC starts at 2 and ends on 9th nucleotide (including 9th resulting in range 2-10). 1-11 denote that start and end can be extended to these values resulting in the same alteration.

property seq

Deleted sequence.

ambig_equal(other)

Returns true if two variants are equal up to indel ambiguity.

edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.Mixed(chrom, start, end, alt, ref=None)

Combination of Indel and MNP. Usage of this class is discouraged and exists for compatibility.

edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.Haplotype(chrom, variants)

An object representing genome variants on the same chromosome (or contig).

Can be instantiated from a list of GenomVariant objects using Haplotype.from_variants() class method.

find_vrt(start=0, end=2147483647)

Finds variants in specified region of a haplotype.

Parameters
  • start (int) – Start of search interval. Default: 0

  • end (int) – End of search interval. Defaults to MAX_END

Yields

vrt (variants)

classmethod from_variants(variants)

Create haplotype from a list of variants.

Parameters

variants (list of variants) – Variants to instantiate haplotype from.

Returns

hap – Haplotype variant object.

Return type

Haplotype

edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.Null(chrom, start, end, ref, alt)
edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.Asterisk(chrom, start, end, ref, alt)
edit_equal(other)

Returns True if self represents the same alteration as the other

is_variant_instance(cls)

Returns True if object’s variant class is subclass of cls. See module genomvar.variant documentation for class hierarchy.

classmethod is_variant_subclass(other)

Checks is variant class is subclass of other. See module genomvar.variant documentaion for class hierarchy.

property key

Tuple representation of variant.

property vtp

Variant class

class genomvar.variant.GenomVariant(base, attrib=None)

This is a variant class holding gemomic alteration (a VariantBase subclass object) and extra attributes. Underlying variant is accessible as base attribute. On top it has genotype as GT attribute and attrib containing extra attributes parsed from VCF files. All attributes of underlying base (start, end etc.) are accessible from GenomVariant object.

edit_equal(other)

Check if GenomVariant holds the same genome alteration as other.

Parameters

variant (GenomVariant or any VariantBase) – variant to compare

Returns

equality – True if the same genomic alteration.

Return type

bool

class genomvar.variant.VariantFactory(reference=None, normindel=False)

Factory class used to create Variant objects. Can be instantiated without arguments or an instance of genomvar.Reference can be given. Additionnaly if normindel=True (reference should be given) indels will be checked for ambiguity, and if ambigous instance of genomvar.variant.AmbigIndel will be returned.

>>> from genomvar import Reference
>>> from genomvar.variant import VariantFactory
>>> reference = Reference('test.fasta')
>>> fac = VariantFactory(reference,normindel=True)
from_hgvs(st)

Returns a variant from a HGVS notation string. Only some of possible HGVS notations are supported as listed below:

>>> fac = VariantFactory()
>>> print(fac.from_hgvs('chr1:g.15C>A'))
<SNP chr1:14 C/A>
>>> print(fac.from_hgvs('chrW:g.19_21del'))
<Del chrW:18-21 NNN/->
>>> print(fac.from_hgvs('chr24:g.10_11insCCT'))
<Ins chr24:10 -/CCT>
>>> print(fac.from_hgvs('chr24:g.10delinsGA'))
<Mixed chr24:9-10 -/GA>
>>> print(fac.from_hgvs('chr23:g.145_147delinsTGG'))
<MNP chr23:144-147 NNN/TGG>
from_edit(chrom, start, ref, alt)

Takes chrom, start, ref and alt (similar to a row in VCF file, but start is 0-based) and returns a variant object.

>>> vrt = fac.from_edit('chr15',575,'TA','T')
>>> print(vrt)
<Del chr15:576-577 A/->

Method attempts to strip ref and alt sequences from left and right (in this order), for example,

>>> vrt = fac.from_edit('chr15',start=65,ref='CTG',alt='CTC')
>>> print(vrt)
<SNP chr15:67-68 G/C>

Module varset.py

Module varset defines variant set classes which are containers of variants supporting set-like operations, searching variants, export to NumPy…

There are two classes depending on whether all the variants are loaded into memory or not:

class genomvar.varset.VariantSet(variants, attrib=None, vcf_notation=None, info=None, sampdata=None, reference=None, dtype=None, samples=None)

Immutable class for storing variants in-memory.

Supports random access by variant location. Can be instantiated from a VCF file (see VariantSet.from_vcf()) or from iterable containing variants (see VariantSet.from_variants()).

reference

Genome reference.

Type

Reference

ctg_len

dict mapping Chromosome name to corresponding lengths.

This keeps track of the rightmost position of variants in the set if reference was not provided. If reference was given ctg_len is taken from length of chromosomes in the reference.

Type

dict

sample(size)

Returns a random sample of variants.

Haplotypes (if present) maybe returned equally likely as non-haplotype variants and their subvariants.

Parameters

size (int) – size of the sample

Returns

marray – List of variants.

Return type

list

iter_vrt(expand=False)

Iterate over all variants in the set.

Parameters

expand (bool, optional) – If True variants in encountered haplotypes are yielded. Default: False

Yields

genomvar.variant.GenomVariant

copy()

Returns a copy of variant set

nof_unit_vrt()

Returns number of simple genomic variations in a variant set.

SNPs and indels count as 1. MNPs as number of affected nucleotides. Mixed variants (if present) are count also as 1.

drop_duplicates(return_dropped=False)

Remove non-unique variants.

Returns a variant set where no pair of variants is edit-equal. Haplotypes are left as is.

Parameters

return_dropped (bool) – Whether to return a list of dropped variants. Default: False

Returns

  • vs (VariantSet) – Variant set with unique variants.

  • dropped (list of variants, optional) – if return_dropped is True dropped variants are also returned.

classmethod from_vcf(vcf, reference=None, parse_info=False, parse_samples=False, normindel=False, duplicates='ignore', parse_null=False)

Parse VCF variant file and return VariantSet object

Parameters
  • vcf (str) – VCF file to read data from.

  • reference (genomvar.Reference, optional) – Reference genome. Default: None

  • parse_info (bool, optional) – If True INFO fields will be parsed. Default: False.

  • parse_samples (bool, str or list of str, optional) – If True all sample data will be parsed. If string, sample with this name will be parsed. If list of strings only these samples will be parsed. Default: False

  • normindel (bool, optional) – If True indels will be normalized. reference should also be provided. Default: False

  • duplicates ({'ignore','keepfirst','raise'}) – How to treat duplicate variants. If keepfirst only the first is taken. If raise error is raised. Default: ``ignore`` and do not check for duplicates.

  • parse_null (bool, optional) – If True null variants will also be parsed (experimental). Default: False

Returns

Return type

VariantSet

classmethod from_variants(variants, reference=None)

Instantiate VariantSet from iterable of variants.

Parameters
  • variants (iterable) – Variants to add.

  • reference (Reference, optional) – Genome reference Default: None

Returns

Return type

VariantSet

to_records(nested=True)

Export self to NumPy ndarray.

Parameters

nested (bool, optional) – Matters only if self was instatiated with INFO or SAMPDATA. When True dtype of the return will be nested. If False no recursive fields will be present and “info_” or “SAMP_%SAMP%” will be prepended to corresponding fields Default: True

Returns

Return type

NumPy ndarray with structured dtype

property chroms

Chromosome set

sort_chroms(key=None)

Sorts chromosomes in the set

This can influence order in which methods return variants and variant set comparison.

Parameters

key (callable, optional) – key to use for chromosome sorting. If not given will sort lexicographically. Default: None

Returns

Return type

None

diff(other, match_partial=True, match_ambig=False)

Returns a new variant set object which has variants present in self but absent in the other.

Parameters
  • match_partial (bool) – If False common positions won’t be searched in non-identical MNPs. Default: True

  • match_ambig (bool) – If False ambigous indels will be treated as regular indels. Default: False

Returns

diff – different variants

Return type

VariantSet

Notes

Here is an example of diff operation on two variant sets s1 and s2. Positions identical to REF are empty:

REF         GATTGGTAC
---------------------
s1          C  CCC  T
               CCC  G
---------------------
s2                  T
              AC    T
=====================
s1.diff(s2) C   CC
                CC  G
---------------------
s2.diff(s1)   A
comm(other, match_partial=True, match_ambig=False)

Returns a new variant set object which has variants present both self and other.

Parameters
  • match_partial (bool) – If False common positions won’t be searched in non-identical MNPs. Default: True

  • match_ambig (bool) – If False ambigous indels will be treated as regular indels. Default: False

Returns

comm – Common variants.

Return type

VariantSet

Notes

Here is an example of diff operation on two variant sets s1 and s2. Positions identical to REF are empty:

REF         GATTGGTAC
---------------------
s1          C  CCC  T
               CCC  G
---------------------
s2                  T
              AC    T
=====================
s1.comm(s2)    C    T
               C
comm_vrt(other, match_partial=True, match_ambig=False)

Generate variants common between self and other.

This creates a lazily evaluated object of class CmpSet. This object can be further inspected with .iter_vrt() or .region()

Parameters
  • other (variant set) – Variant set to compare against self

  • match_partial (bool, optional) – If False common positions won’t be searched in non-identical MNPs. Default: True

  • match_ambig (bool, optional) – If False ambigous indels will be treated as regular indels. Default: False

Returns

Transient comparison set

Return type

CmpSet

diff_vrt(other, match_partial=True, match_ambig=False)

Generate variants different between self and other.

This creates a lazily evaluated object of class CmpSet. This object can be further inspected with .iter_vrt() or .region()

Parameters
  • other (variant set) – Variant set to compare against self

  • match_partial (bool, optional) – If False common positions won’t be searched in non-identical MNPs. Default: True

  • match_ambig (bool, optional) – If False ambigous indels will be treated as regular indels. Default: False

Returns

Transient comparison set

Return type

CmpSet

find_vrt(chrom=None, start=0, end=2147483647, rgn=None, expand=False, check_order=False)

Finds variants in specified region.

If not chrom nor rgn parameters are given it traverses all variants.

Parameters
  • chrom (str, optional) – Chromosome to search variants on. If not given, all chroms are traversed and start and end parameters ignored.

  • start (int, optional) – Start of search region (0-based). Default: 0

  • end (int, optional) – End of search region (position end is excluded). Default: max chrom length

  • rgn (str, optional) – If given chrom, start and end parameters are ignored. A string specifying search region, e.g. chr15:365,008-365,958.

  • expand (bool, optional) – Indicates whether to traverse haplotypes. Default: False

Returns

Return type

List of genomvar.variant.VariantBase subclass objects

get_factory(normindel=False)

Returns a VariantFactory object. It will inherit the same reference as was used for instantiation of self

iter_vrt_by_chrom(**kwds)

Iterates variants grouped by chromosome.

match(vrt, match_partial=True, match_ambig=False)

Find variants mathing vrt.

The method checks whether variant vrt is present in the set.

Parameters
  • vrt (VariantBase-like) – Variant to search match for.

  • match_partial (bool, optional) – If False common positions won’t be searched in non-identical MNPs (relevant if vrt is MNP or SNP or haplotype, containing these). Default: True

  • match_ambig (bool, optional) – If False ambigous indels will be treated as regular indels (relevant if vrt is Ambigindel or haplotype, containing ones). Default: False

Returns

dict is returned only of vrt is a Haplotype

Return type

list or dict with matching variants

ovlp(vrt0, match_ambig=False)

Returns variants altering the same positions as vrt0.

Parameters
  • vrt0 (VariantBase-like) – Variant to search overlap with.

  • match_ambig (bool) – If False ambigous indels treated as regular indels. Default: False

Returns

Return type

list of variants

to_vcf(out, reference=None, info_spec=None, format_spec=None, samples=None)

Writes a minimal VCF with variants from the set.

INFO and SAMPLE data from source data is not preserved.

Parameters
  • out (handle or str) – If string then it’s path to file, otherwise a handle to write variants.

  • reference (Reference or str, optional) – If string then it’s path to reference FASTA. Otherwise object of genomvar.Reference is expected. Not necessary if self was instantiated with a reference. If given, this argument takes precedence over the ref used at instantiation.

  • info_spec (list of tuples containing specification of) – an INFO field in Order per VCF spec, i.e. NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION. Only the first three fiels (name, number and type) are required. If variant set was instantiated from VCF file spec obtained from it will be used. Providing empty sequence will omit INFO fields in output.

  • format_spec (spec of FORMAT fields, similar to info_spec) –

Returns

Return type

None

Examples

>>> with open(fname, 'wt') as fh:
...     vs.to_vcf(fh, info_spec=[('DP4', 4, 'Integer'),
...                              ('NSV', 1, 'Integer')])
class genomvar.varset.CmpSet(left, right, op, match_partial=True, match_ambig=False)

Class CmpSet is a lazy implementation of variant set comparison which is returned by methods diff_vrt and comm_vrt of any variant set class instance. Useful for comparisons including large VCF files while keeping memory profile low.

region(chrom=None, start=0, end=2147483647, rgn=None)

Generates variants corresponding to comparison within a region.

Region can be specified using chrom,start,end or rgn with notation like chr15:1000-2000

Parameters
  • chrom (str) – Chromosome

  • start (int, optional) – Start of region. Default: 0

  • end (int) – End of region. Default: max chrom length

  • rgn (str) – Region to yield variants from

Yields

genomvar.variant.GenomVariant

iter_vrt(callback=None)

Generates variants corresponding to comparison.

Parameters

callback (callable) – A function to be called on a match in right variants set for every variant of left variant set. Result will be stored in vrt.attrib['cmp'] of yielded variant

Yields

variants (genomvar.variant.GenomVariant or tuple (int, genomvar.variant.GenomVariant)) – if action is diff or comm function yields instances of genomvar.variant.GenomVariant. If action is all function yields tuples where int is 0 if variant is present in first set; 1 if in second; 2 if in both.

class genomvar.varset.VariantSetFromFile(file, reference=None, index=None, parse_info=False, parse_samples=False)

Variant set representing variants contained in underlying VCF file specified on initialization.

file

Variant file storing variants.

Type

str

get_factory()

Returns a VariantFactory object. It will inherit the same reference as was used for instantiation of self

iter_vrt_by_chrom(check_order=False)

Iterates variants grouped by chromosome.

comm_vrt(other, match_partial=True, match_ambig=False)

Generate variants common between self and other.

This creates a lazily evaluated object of class CmpSet. This object can be further inspected with .iter_vrt() or .region()

Parameters
  • other (variant set) – Variant set to compare against self

  • match_partial (bool, optional) – If False common positions won’t be searched in non-identical MNPs. Default: True

  • match_ambig (bool, optional) – If False ambigous indels will be treated as regular indels. Default: False

Returns

Transient comparison set

Return type

CmpSet

diff_vrt(other, match_partial=True, match_ambig=False)

Generate variants different between self and other.

This creates a lazily evaluated object of class CmpSet. This object can be further inspected with .iter_vrt() or .region()

Parameters
  • other (variant set) – Variant set to compare against self

  • match_partial (bool, optional) – If False common positions won’t be searched in non-identical MNPs. Default: True

  • match_ambig (bool, optional) – If False ambigous indels will be treated as regular indels. Default: False

Returns

Transient comparison set

Return type

CmpSet

find_vrt(chrom=None, start=0, end=2147483647, rgn=None, expand=False, check_order=False)

Finds variants in specified region.

If not chrom nor rgn parameters are given it traverses all variants.

Parameters
  • chrom (str, optional) – Chromosome to search variants on. If not given, all chroms are traversed and start and end parameters ignored.

  • start (int, optional) – Start of search region (0-based). Default: 0

  • end (int, optional) – End of search region (position end is excluded). Default: max chrom length

  • rgn (str, optional) – If given chrom, start and end parameters are ignored. A string specifying search region, e.g. chr15:365,008-365,958.

  • expand (bool, optional) – Indicates whether to traverse haplotypes. Default: False

Returns

Return type

List of genomvar.variant.VariantBase subclass objects

iter_vrt(expand=False)

Iterate over all variants in underlying VCF.

Parameters

expand (bool, optional) – If True variants in encountered haplotypes are yielded. Default: False

Yields

variant (genomvar.variant.GenomVariant)

match(vrt, match_partial=True, match_ambig=False)

Find variants mathing vrt.

The method checks whether variant vrt is present in the set.

Parameters
  • vrt (VariantBase-like) – Variant to search match for.

  • match_partial (bool, optional) – If False common positions won’t be searched in non-identical MNPs (relevant if vrt is MNP or SNP or haplotype, containing these). Default: True

  • match_ambig (bool, optional) – If False ambigous indels will be treated as regular indels (relevant if vrt is Ambigindel or haplotype, containing ones). Default: False

Returns

dict is returned only of vrt is a Haplotype

Return type

list or dict with matching variants

ovlp(vrt0, match_ambig=False)

Returns variants altering the same positions as vrt0.

Parameters
  • vrt0 (VariantBase-like) – Variant to search overlap with.

  • match_ambig (bool) – If False ambigous indels treated as regular indels. Default: False

Returns

Return type

list of variants

to_vcf(out, reference=None, info_spec=None, format_spec=None, samples=None)

Writes a minimal VCF with variants from the set.

INFO and SAMPLE data from source data is not preserved.

Parameters
  • out (handle or str) – If string then it’s path to file, otherwise a handle to write variants.

  • reference (Reference or str, optional) – If string then it’s path to reference FASTA. Otherwise object of genomvar.Reference is expected. Not necessary if self was instantiated with a reference. If given, this argument takes precedence over the ref used at instantiation.

  • info_spec (list of tuples containing specification of) – an INFO field in Order per VCF spec, i.e. NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION. Only the first three fiels (name, number and type) are required. If variant set was instantiated from VCF file spec obtained from it will be used. Providing empty sequence will omit INFO fields in output.

  • format_spec (spec of FORMAT fields, similar to info_spec) –

Returns

Return type

None

Examples

>>> with open(fname, 'wt') as fh:
...     vs.to_vcf(fh, info_spec=[('DP4', 4, 'Integer'),
...                              ('NSV', 1, 'Integer')])
property chroms

Chromosome set

Module vcf.py

Module contains classes needed for VCF parsing.

Main class is VCFReader which is instantiated from VCF file. VCF can be gzipped. Bgzipping and tabix-derived indexing is also supported for random coordinate-based access.

VCFReader class can iterate over rows which are tuple-like object containing VCF field strings as attributes without conversion (except POSition which is converted to int).

Alternatively iteration of variants is supported. It yields genomvar.variant.GenomVariant objects.

>>> reader = VCFReader('example.vcf.gz')
>>> vrt = next(reader.iter_vrt(parse_info=True, parse_samples=True))
>>> print(vrt)
<GenomVariant: Del chr24:23-24 G/->
>>> print(vrt.attrib['info'])
{'NSV': 2, 'AF': 0.5, 'DP4': (11, 22, 33, 44), 'ECNT': 1,    'pl': 3, 'mt': 'SUBSTITUTE', 'RECN': 18, 'STR': None}

If parse_info and parse_samples parameters are True Then INFO and SAMPLEs fields contained in VCF are parsed and split corresponding to an allele captured by variant object. For performance reasons these parameters are set to False and these fields are not parsed.

.bcf format is supported, use BCFReader.

class genomvar.vcf.VCFWriter(reference=None, info_spec=None, format_spec=None, samples=None)

Class for writing variant to VCF format.

get_row(vrt, **kwds)

Formats a variant to a genomvar.vcf_utils.VCFRow instance.

For indels writer with reference might be needeed to costruct correct REF field.

Parameters
  • vrt (Variant instance) – variant to get row for.

  • kwds (VCF fields #FIXME remove kwds here) –

    optional. If given, these and only these parameters are used to populate corresponding VCF fields: id, qual, filter, info, sampdata. These parameters are taken as is and converted to string before returning a VCFRow. If the keyword is not given corresponding field will be populated from attrib attribute if possible.

    Any other keyword arguments are ignored.

Returns

rowgenomvar.vcf_utils.VCFRow object instance

Return type

VCFRow

Notes

>>> factory = variant.VariantFactory()
>>> writer = VCFWriter()
>>> v1 = factory.from_edit('chr24', 2093, 'TGG', 'CCC')
>>> row = writer.get_row(v1)
>>> row
<VCFRow chr24:2094 TGG->CCC>
>>> print(row)
chr24   2094    .       TGG     CCC     .       .       .
>>> print(writer.get_row(v1, id=123, info='DP=10'))
chr24   2094    123     TGG     CCC     .       .       DP=10
class genomvar.vcf.VCFReader(vcf, index=None, reference=None)

Class to read VCF files.

get_factory(normindel=False)

Returns a factory based on normindel parameter.

find_rows(chrom=None, start=None, end=None, rgn=None)

Yields rows of variant file

iter_rows(check_order=False)

Yields rows of variant file

iter_rows_by_chrom(check_order=False)

Yields rows grouped by chromosome

get_records(parse_info=False, parse_samples=False, normindel=False)

Returns parsed variant data as a dict of NumPy arrays with structured dtype.

iter_vrt(check_order=False, parse_info=False, normindel=False, parse_samples=False)

Yields variant objects.

Parameters
  • parse_info (bool) – Whether INFO fields should be parsed. Default: False

  • parse_samples (bool) – Whether SAMPLEs dat should be parsed. Default: False

  • check_order (bool) – If True will raise exception on unsorted VCF rows. Default: False

  • normindel (bool) – If True insertions and deletions will be left normalized. Requires a reference on VCFReader instantiation.

Yields

vrt (genomvar.variant.GenomVariant) – Variant object

iter_vrt_by_chrom(parse_info=False, parse_samples=False, normindel=False, check_order=False)

Generates variants grouped by chromosome

Parameters
  • parse_info (bool) – Incates whether INFO fields should be parsed. Default: False

  • parse_samples (bool) – Incates whether SAMPLE data should be parsed. Default: False

  • parse_samples – If True indels will be normalized. VCFReader should have been instantiated with reference. Default: False

  • check_order (bool) – If True VCF will be checked for sorting. Default: False

Yields

(chrom,it) (tuple of str and iterator) – it yields variant.Genomvariant objects

find_vrt(chrom=None, start=0, end=2147483647, check_order=False, parse_info=False, normindel=False, parse_samples=False)

Yields variant objects from a specified region

get_chroms(allow_no_index=False)

Returns ChromSet corresponding to VCF. If indexed then index is used for faster access. Alternatively if allow_no_index is True the whole file is parsed to get chromosome ordering.

class genomvar.vcf.BCFReader(bcf, index=False, reference=None)
iter_rows(check_order=None)

Yields rows of variant file

find_rows(chrom=None, start=None, end=None, rgn=None)

Yields rows of variant file

find_vrt(chrom=None, start=0, end=2147483647, check_order=False, parse_info=False, normindel=False, parse_samples=False)

Yields variant objects from a specified region

get_chroms(allow_no_index=False)

Returns ChromSet corresponding to VCF. If indexed then index is used for faster access. Alternatively if allow_no_index is True the whole file is parsed to get chromosome ordering.

get_factory(normindel=False)

Returns a factory based on normindel parameter.

get_records(parse_info=False, parse_samples=False, normindel=False)

Returns parsed variant data as a dict of NumPy arrays with structured dtype.

iter_rows_by_chrom(check_order=False)

Yields rows grouped by chromosome

iter_vrt(check_order=False, parse_info=False, normindel=False, parse_samples=False)

Yields variant objects.

Parameters
  • parse_info (bool) – Whether INFO fields should be parsed. Default: False

  • parse_samples (bool) – Whether SAMPLEs dat should be parsed. Default: False

  • check_order (bool) – If True will raise exception on unsorted VCF rows. Default: False

  • normindel (bool) – If True insertions and deletions will be left normalized. Requires a reference on VCFReader instantiation.

Yields

vrt (genomvar.variant.GenomVariant) – Variant object

iter_vrt_by_chrom(parse_info=False, parse_samples=False, normindel=False, check_order=False)

Generates variants grouped by chromosome

Parameters
  • parse_info (bool) – Incates whether INFO fields should be parsed. Default: False

  • parse_samples (bool) – Incates whether SAMPLE data should be parsed. Default: False

  • parse_samples – If True indels will be normalized. VCFReader should have been instantiated with reference. Default: False

  • check_order (bool) – If True VCF will be checked for sorting. Default: False

Yields

(chrom,it) (tuple of str and iterator) – it yields variant.Genomvariant objects

Module vcf_utils.py

class genomvar.vcf_utils.VCF_INFO_OR_FORMAT_SPEC(NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION)
DESCRIPTION

Alias for field number 3

NAME

Alias for field number 0

NUMBER

Alias for field number 1

SOURCE

Alias for field number 4

TYPE

Alias for field number 2

VERSION

Alias for field number 5

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

class genomvar.vcf_utils.VCFRow(CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT=None, SAMPLES=None, rnum=None)

Tuple-like object storing variant information in VCF-like form.

str() returns a string, formatted as a row in VCF file.

genomvar.vcf_utils.issequence(seq)

Is seq a sequence (ndarray, list or tuple)?