Package contents¶
Module variant.py¶
Module genomvar.variant
contains classes representing genomic
alterations.
The hierarchy of the classes used in the package is the following:
VariantBase
/ | \
AmbigIndel <-Indel V |
| | / \ MNP V
| ----+--- Del Ins | Haplotype
| | | | V
V V | V SNP
AmbigDel -----> AmbigIns
All variants have start
and end
attributes defining a range they
act on and can be searched overlap for.
To test whether a variant is instance of some type
is_variant_instance()
method can be used. Variant equality
can be tested using edit_equal()
.
Objects can be instantiated directly, e.g.:
>>> vrt = variant.MNP('chr1',154678,'GT')
>>> print(vrt)
<MNP chr1:154678-154680 NN/GT>
This will create an MNP which substitutes positions 154678 and 154679 on
chromosome 1 for GT
.
Alternatively variants can be created using VariantFactory
objects. This class can work with VCF-like notation. For example,
>>> fac = VariantFactory()
>>> vrt = fac.from_edit('chr15',575,'TA','T')
>>> print(vrt)
<Del chr15:576-577 A/->
Position is 0-based so it creates a deletion at position 577 of chromosome 15.
Alternatively, limited subset of HGVS notation is supported (numbering in HGVS strings is 1-based following the spec):
>>> vrt = fac.from_hgvs('chr1:g.15C>A')
>>> print(vrt)
<SNP chr1:14 C/A>
Variant sets defined in genomvar.varset
use class GenomVariant
.
Objects of this class contain genomic alteration (attribute base
)
and optionally, genotype (attribute GT
) and other attributes
commonly found in VCF files (attribute attrib
). Attribute
base
is an object of some VariantBase
subclass (SNPs, Deletions
etc.).
-
class
genomvar.variant.
VariantBase
(chrom, start, end, ref, alt)¶ Base class for the other genomic variant classes.
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
property
vtp
¶ Variant class
-
property
key
¶ Tuple representation of variant.
-
-
class
genomvar.variant.
MNP
(chrom, start, alt, end=None, ref=None)¶ Multiple-nucleotide polymorphism. Substitute N nucleotides of the reference for N other nucleotides.
For instantiation it requires chromosome,position and alternative sequence.
end
will inferred fromstart
andalt
.ref
is also optional.>>> from genomvar import variant >>> variant.MNP('chr1',154678,'GT') MNP("chr1",154678,"GT")
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
SNP
(chrom, start, alt, end=None, ref=None)¶ Single-nucleotide polymorphism.
For instantiation it requires chromosome, position and alternative sequence.
>>> from genomvar import variant >>> variant.SNP('chr1',154678,'T') SNP("chr1",154678,"T")
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
Indel
(chrom, start, end, ref, alt)¶ Abstract class to accomodate insertion/deletion subclasses
-
ambig_equal
(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
Ins
(chrom, start, alt, end=None, ref=None)¶ Insertion of nucleotides. For instantiation
chrom
,start
and inserted sequence (alt
) are required.Start and end denote the nucleotide after the inserted sequence, i.e.
start
is 0-based number of a nucleotide after insertion,end
isstart+1
by definition.>>> from genomvar.variant import Ins # Insertion of TA before position chr2:100543. >>> print(Ins('chr2',100543,'TA')) <Ins chr2:100543 -/TA>
-
ambig_equal
(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
Del
(chrom, start, end, ref=None, alt=None)¶ Deletion of nucleotides. For instantiation
chrom
,start
(0-based), andend
(positionend
is excluded) are required.>>> from genomvar.variant import Del # Deletion of 3 nucleotides starting at chr3:7843488 (0-based) >>> print(Del('chr3',7843488,7843488+3)) <Del chr3:7843488-7843491 NNN/->
-
ambig_equal
(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
AmbigIndel
(chrom, start, end, ref, alt)¶ Class representing indel which position is ambigous. Ambiguity means the indel could be applied in any position of some region resulting in the same alternative sequence.
-
ambig_equal
(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
AmbigIns
(chrom, start, end, alt, ref=None)¶ Class representing indel which position is ambigous. Ambiguity means the indel could be applied in any position of some region resulting in the same alternative sequence.
Let the reference file
test.fasta
contain a toy sequence:>seq1 TTTAATA
Consider a variant extending 3
T
s in the beginning by one more T. It can be done in several places so the corresponding insertion can be given as andAmbigIns
object:>>> from genomvar import Reference >>> from genomvar.variant import VariantFactory >>> fac = VariantFactory(Reference('test.fasta'),normindel=True) >>> print( fac.from_edit('seq1',0,'T','TT') ) <AmbigIns seq1:0-4(1-2) -/T>
Positions 1 and 2 are actual start and end meaning that T is inserted before nucleotide located 1-2. Positions 0-4 indicate that start and end can be extended to these values resulting in the same alteration.
-
property
seq
¶ Inserted sequence.
-
ambig_equal
(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
property
-
class
genomvar.variant.
AmbigDel
(chrom, start, end, ref=None, alt=None)¶ Class representing del which position is ambigous. Ambiguity means the same number of positions could be deleted in some range resulting in the same aternative sequence.
Let the reference file
test.fasta
contain a toy sequence:>seq1 TCTTTTTGACTGG
>>> fac = VariantFactory(Reference('test.fasta'),normindel=True) >>> print( fac.from_edit('seq1',1,'CTTTTTGAC','C') ) <AmbigDel seq1:1-11(2-10) TTTTTGAC/->
Deletion of TTTTTGAC starts at 2 and ends on 9th nucleotide (including 9th resulting in range 2-10). 1-11 denote that start and end can be extended to these values resulting in the same alteration.
-
property
seq
¶ Deleted sequence.
-
ambig_equal
(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
property
-
class
genomvar.variant.
Mixed
(chrom, start, end, alt, ref=None)¶ Combination of Indel and MNP. Usage of this class is discouraged and exists for compatibility.
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
Haplotype
(chrom, variants)¶ An object representing genome variants on the same chromosome (or contig).
Can be instantiated from a list of GenomVariant objects using
Haplotype.from_variants()
class method.-
find_vrt
(start=0, end=2147483647)¶ Finds variants in specified region of a haplotype.
- Parameters
start (int) – Start of search interval. Default: 0
end (int) – End of search interval. Defaults to MAX_END
- Yields
vrt (variants)
-
classmethod
from_variants
(variants)¶ Create haplotype from a list of variants.
- Parameters
variants (list of variants) – Variants to instantiate haplotype from.
- Returns
hap – Haplotype variant object.
- Return type
-
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
Null
(chrom, start, end, ref, alt)¶ -
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
Asterisk
(chrom, start, end, ref, alt)¶ -
edit_equal
(other)¶ Returns True if
self
represents the same alteration as theother
-
is_variant_instance
(cls)¶ Returns True if object’s variant class is subclass of
cls
. See modulegenomvar.variant
documentation for class hierarchy.
-
classmethod
is_variant_subclass
(other)¶ Checks is variant class is subclass of
other
. See modulegenomvar.variant
documentaion for class hierarchy.
-
property
key
¶ Tuple representation of variant.
-
property
vtp
¶ Variant class
-
-
class
genomvar.variant.
GenomVariant
(base, attrib=None)¶ This is a variant class holding gemomic alteration (a VariantBase subclass object) and extra attributes. Underlying variant is accessible as
base
attribute. On top it has genotype asGT
attribute andattrib
containing extra attributes parsed from VCF files. All attributes of underlyingbase
(start, end etc.) are accessible from GenomVariant object.-
edit_equal
(other)¶ Check if GenomVariant holds the same genome alteration as
other
.- Parameters
variant (GenomVariant or any VariantBase) – variant to compare
- Returns
equality – True if the same genomic alteration.
- Return type
bool
-
-
class
genomvar.variant.
VariantFactory
(reference=None, normindel=False)¶ Factory class used to create Variant objects. Can be instantiated without arguments or an instance of
genomvar.Reference
can be given. Additionnaly ifnormindel=True
(reference should be given) indels will be checked for ambiguity, and if ambigous instance ofgenomvar.variant.AmbigIndel
will be returned.>>> from genomvar import Reference >>> from genomvar.variant import VariantFactory >>> reference = Reference('test.fasta') >>> fac = VariantFactory(reference,normindel=True)
-
from_hgvs
(st)¶ Returns a variant from a HGVS notation string. Only some of possible HGVS notations are supported as listed below:
>>> fac = VariantFactory() >>> print(fac.from_hgvs('chr1:g.15C>A')) <SNP chr1:14 C/A> >>> print(fac.from_hgvs('chrW:g.19_21del')) <Del chrW:18-21 NNN/-> >>> print(fac.from_hgvs('chr24:g.10_11insCCT')) <Ins chr24:10 -/CCT> >>> print(fac.from_hgvs('chr24:g.10delinsGA')) <Mixed chr24:9-10 -/GA> >>> print(fac.from_hgvs('chr23:g.145_147delinsTGG')) <MNP chr23:144-147 NNN/TGG>
-
from_edit
(chrom, start, ref, alt)¶ Takes chrom, start, ref and alt (similar to a row in VCF file, but start is 0-based) and returns a variant object.
>>> vrt = fac.from_edit('chr15',575,'TA','T') >>> print(vrt) <Del chr15:576-577 A/->
Method attempts to strip ref and alt sequences from left and right (in this order), for example,
>>> vrt = fac.from_edit('chr15',start=65,ref='CTG',alt='CTC') >>> print(vrt) <SNP chr15:67-68 G/C>
-
Module varset.py¶
Module varset
defines variant set classes which are containers of variants
supporting set-like operations, searching variants, export to NumPy…
There are two classes depending on whether all the variants are loaded into memory or not:
VariantSet
in-memory
VariantSetFromFile
can read VCF files on demand, needs index for random access
-
class
genomvar.varset.
VariantSet
(variants, attrib=None, vcf_notation=None, info=None, sampdata=None, reference=None, dtype=None, samples=None)¶ Immutable class for storing variants in-memory.
Supports random access by variant location. Can be instantiated from a VCF file (see
VariantSet.from_vcf()
) or from iterable containing variants (seeVariantSet.from_variants()
).-
reference
¶ Genome reference.
- Type
Reference
-
ctg_len
¶ dict mapping Chromosome name to corresponding lengths.
This keeps track of the rightmost position of variants in the set if reference was not provided. If reference was given
ctg_len
is taken from length of chromosomes in the reference.- Type
dict
-
sample
(size)¶ Returns a random sample of variants.
Haplotypes (if present) maybe returned equally likely as non-haplotype variants and their subvariants.
- Parameters
size (int) – size of the sample
- Returns
marray – List of variants.
- Return type
list
-
iter_vrt
(expand=False)¶ Iterate over all variants in the set.
- Parameters
expand (bool, optional) – If
True
variants in encountered haplotypes are yielded. Default: False- Yields
-
copy
()¶ Returns a copy of variant set
-
nof_unit_vrt
()¶ Returns number of simple genomic variations in a variant set.
SNPs and indels count as 1. MNPs as number of affected nucleotides. Mixed variants (if present) are count also as 1.
-
drop_duplicates
(return_dropped=False)¶ Remove non-unique variants.
Returns a variant set where no pair of variants is edit-equal. Haplotypes are left as is.
- Parameters
return_dropped (bool) – Whether to return a list of dropped variants. Default: False
- Returns
vs (VariantSet) – Variant set with unique variants.
dropped (list of variants, optional) – if
return_dropped
is True dropped variants are also returned.
-
classmethod
from_vcf
(vcf, reference=None, parse_info=False, parse_samples=False, normindel=False, duplicates='ignore', parse_null=False)¶ Parse VCF variant file and return VariantSet object
- Parameters
vcf (str) – VCF file to read data from.
reference (
genomvar.Reference
, optional) – Reference genome. Default: Noneparse_info (bool, optional) – If
True
INFO fields will be parsed. Default: False.parse_samples (bool, str or list of str, optional) – If
True
all sample data will be parsed. If string, sample with this name will be parsed. If list of strings only these samples will be parsed. Default: Falsenormindel (bool, optional) – If
True
indels will be normalized.reference
should also be provided. Default: Falseduplicates ({'ignore','keepfirst','raise'}) – How to treat duplicate variants. If
keepfirst
only the first is taken. Ifraise
error is raised. Default: ``ignore`` and do not check for duplicates.parse_null (bool, optional) – If
True
null variants will also be parsed (experimental). Default: False
- Returns
- Return type
-
classmethod
from_variants
(variants, reference=None)¶ Instantiate VariantSet from iterable of variants.
- Parameters
variants (iterable) – Variants to add.
reference (Reference, optional) – Genome reference Default: None
- Returns
- Return type
-
to_records
(nested=True)¶ Export
self
to NumPy ndarray.- Parameters
nested (bool, optional) – Matters only if
self
was instatiated with INFO or SAMPDATA. WhenTrue
dtype of the return will be nested. IfFalse
no recursive fields will be present and “info_” or “SAMP_%SAMP%” will be prepended to corresponding fields Default: True- Returns
- Return type
NumPy ndarray with structured dtype
-
property
chroms
¶ Chromosome set
-
sort_chroms
(key=None)¶ Sorts chromosomes in the set
This can influence order in which methods return variants and variant set comparison.
- Parameters
key (callable, optional) – key to use for chromosome sorting. If not given will sort lexicographically. Default: None
- Returns
- Return type
None
-
diff
(other, match_partial=True, match_ambig=False)¶ Returns a new variant set object which has variants present in
self
but absent in theother
.- Parameters
match_partial (bool) – If
False
common positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool) – If
False
ambigous indels will be treated as regular indels. Default: False
- Returns
diff – different variants
- Return type
Notes
Here is an example of
diff
operation on two variant sets s1 and s2. Positions identical to REF are empty:REF GATTGGTAC --------------------- s1 C CCC T CCC G --------------------- s2 T AC T ===================== s1.diff(s2) C CC CC G --------------------- s2.diff(s1) A
-
comm
(other, match_partial=True, match_ambig=False)¶ Returns a new variant set object which has variants present both
self
andother
.- Parameters
match_partial (bool) – If
False
common positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool) – If
False
ambigous indels will be treated as regular indels. Default: False
- Returns
comm – Common variants.
- Return type
Notes
Here is an example of
diff
operation on two variant sets s1 and s2. Positions identical to REF are empty:REF GATTGGTAC --------------------- s1 C CCC T CCC G --------------------- s2 T AC T ===================== s1.comm(s2) C T C
-
comm_vrt
(other, match_partial=True, match_ambig=False)¶ Generate variants common between
self
andother
.This creates a lazily evaluated object of class
CmpSet
. This object can be further inspected with.iter_vrt()
or.region()
- Parameters
other (variant set) – Variant set to compare against
self
match_partial (bool, optional) – If
False
common positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool, optional) – If
False
ambigous indels will be treated as regular indels. Default: False
- Returns
Transient comparison set
- Return type
-
diff_vrt
(other, match_partial=True, match_ambig=False)¶ Generate variants different between
self
andother
.This creates a lazily evaluated object of class
CmpSet
. This object can be further inspected with.iter_vrt()
or.region()
- Parameters
other (variant set) – Variant set to compare against
self
match_partial (bool, optional) – If
False
common positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool, optional) – If
False
ambigous indels will be treated as regular indels. Default: False
- Returns
Transient comparison set
- Return type
-
find_vrt
(chrom=None, start=0, end=2147483647, rgn=None, expand=False, check_order=False)¶ Finds variants in specified region.
If not
chrom
norrgn
parameters are given it traverses all variants.- Parameters
chrom (str, optional) – Chromosome to search variants on. If not given, all chroms are traversed and
start
andend
parameters ignored.start (int, optional) – Start of search region (0-based). Default: 0
end (int, optional) – End of search region (position
end
is excluded). Default: max chrom lengthrgn (str, optional) – If given chrom, start and end parameters are ignored. A string specifying search region, e.g.
chr15:365,008-365,958
.expand (bool, optional) – Indicates whether to traverse haplotypes. Default: False
- Returns
- Return type
List of
genomvar.variant.VariantBase
subclass objects
-
get_factory
(normindel=False)¶ Returns a VariantFactory object. It will inherit the same reference as was used for instantiation of
self
-
iter_vrt_by_chrom
(**kwds)¶ Iterates variants grouped by chromosome.
-
match
(vrt, match_partial=True, match_ambig=False)¶ Find variants mathing vrt.
The method checks whether variant
vrt
is present in the set.- Parameters
vrt (VariantBase-like) – Variant to search match for.
match_partial (bool, optional) – If
False
common positions won’t be searched in non-identical MNPs (relevant ifvrt
is MNP or SNP or haplotype, containing these). Default: Truematch_ambig (bool, optional) – If
False
ambigous indels will be treated as regular indels (relevant ifvrt
is Ambigindel or haplotype, containing ones). Default: False
- Returns
dict is returned only of
vrt
is a Haplotype- Return type
list or dict with matching variants
-
ovlp
(vrt0, match_ambig=False)¶ Returns variants altering the same positions as
vrt0
.- Parameters
vrt0 (VariantBase-like) – Variant to search overlap with.
match_ambig (bool) – If False ambigous indels treated as regular indels. Default: False
- Returns
- Return type
list of variants
-
to_vcf
(out, reference=None, info_spec=None, format_spec=None, samples=None)¶ Writes a minimal VCF with variants from the set.
INFO and SAMPLE data from source data is not preserved.
- Parameters
out (handle or str) – If string then it’s path to file, otherwise a handle to write variants.
reference (Reference or str, optional) – If string then it’s path to reference FASTA. Otherwise object of
genomvar.Reference
is expected. Not necessary if self was instantiated with a reference. If given, this argument takes precedence over the ref used at instantiation.info_spec (list of tuples containing specification of) – an INFO field in Order per VCF spec, i.e. NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION. Only the first three fiels (name, number and type) are required. If variant set was instantiated from VCF file spec obtained from it will be used. Providing empty sequence will omit INFO fields in output.
format_spec (spec of FORMAT fields, similar to info_spec) –
- Returns
- Return type
None
Examples
>>> with open(fname, 'wt') as fh: ... vs.to_vcf(fh, info_spec=[('DP4', 4, 'Integer'), ... ('NSV', 1, 'Integer')])
-
-
class
genomvar.varset.
CmpSet
(left, right, op, match_partial=True, match_ambig=False)¶ Class
CmpSet
is a lazy implementation of variant set comparison which is returned by methodsdiff_vrt
andcomm_vrt
of any variant set class instance. Useful for comparisons including large VCF files while keeping memory profile low.-
region
(chrom=None, start=0, end=2147483647, rgn=None)¶ Generates variants corresponding to comparison within a region.
Region can be specified using chrom,start,end or rgn with notation like
chr15:1000-2000
- Parameters
chrom (str) – Chromosome
start (int, optional) – Start of region. Default: 0
end (int) – End of region. Default: max chrom length
rgn (str) – Region to yield variants from
- Yields
-
iter_vrt
(callback=None)¶ Generates variants corresponding to comparison.
- Parameters
callback (callable) – A function to be called on a match in right variants set for every variant of left variant set. Result will be stored in
vrt.attrib['cmp']
of yielded variant- Yields
variants (
genomvar.variant.GenomVariant
or tuple (int,genomvar.variant.GenomVariant
)) – ifaction
isdiff
orcomm
function yields instances ofgenomvar.variant.GenomVariant
. Ifaction
isall
function yields tuples whereint
is 0 if variant is present in first set; 1 if in second; 2 if in both.
-
-
class
genomvar.varset.
VariantSetFromFile
(file, reference=None, index=None, parse_info=False, parse_samples=False)¶ Variant set representing variants contained in underlying VCF file specified on initialization.
-
file
¶ Variant file storing variants.
- Type
str
-
get_factory
()¶ Returns a VariantFactory object. It will inherit the same reference as was used for instantiation of
self
-
iter_vrt_by_chrom
(check_order=False)¶ Iterates variants grouped by chromosome.
-
comm_vrt
(other, match_partial=True, match_ambig=False)¶ Generate variants common between
self
andother
.This creates a lazily evaluated object of class
CmpSet
. This object can be further inspected with.iter_vrt()
or.region()
- Parameters
other (variant set) – Variant set to compare against
self
match_partial (bool, optional) – If
False
common positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool, optional) – If
False
ambigous indels will be treated as regular indels. Default: False
- Returns
Transient comparison set
- Return type
-
diff_vrt
(other, match_partial=True, match_ambig=False)¶ Generate variants different between
self
andother
.This creates a lazily evaluated object of class
CmpSet
. This object can be further inspected with.iter_vrt()
or.region()
- Parameters
other (variant set) – Variant set to compare against
self
match_partial (bool, optional) – If
False
common positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool, optional) – If
False
ambigous indels will be treated as regular indels. Default: False
- Returns
Transient comparison set
- Return type
-
find_vrt
(chrom=None, start=0, end=2147483647, rgn=None, expand=False, check_order=False)¶ Finds variants in specified region.
If not
chrom
norrgn
parameters are given it traverses all variants.- Parameters
chrom (str, optional) – Chromosome to search variants on. If not given, all chroms are traversed and
start
andend
parameters ignored.start (int, optional) – Start of search region (0-based). Default: 0
end (int, optional) – End of search region (position
end
is excluded). Default: max chrom lengthrgn (str, optional) – If given chrom, start and end parameters are ignored. A string specifying search region, e.g.
chr15:365,008-365,958
.expand (bool, optional) – Indicates whether to traverse haplotypes. Default: False
- Returns
- Return type
List of
genomvar.variant.VariantBase
subclass objects
-
iter_vrt
(expand=False)¶ Iterate over all variants in underlying VCF.
- Parameters
expand (bool, optional) – If
True
variants in encountered haplotypes are yielded. Default: False- Yields
variant (
genomvar.variant.GenomVariant
)
-
match
(vrt, match_partial=True, match_ambig=False)¶ Find variants mathing vrt.
The method checks whether variant
vrt
is present in the set.- Parameters
vrt (VariantBase-like) – Variant to search match for.
match_partial (bool, optional) – If
False
common positions won’t be searched in non-identical MNPs (relevant ifvrt
is MNP or SNP or haplotype, containing these). Default: Truematch_ambig (bool, optional) – If
False
ambigous indels will be treated as regular indels (relevant ifvrt
is Ambigindel or haplotype, containing ones). Default: False
- Returns
dict is returned only of
vrt
is a Haplotype- Return type
list or dict with matching variants
-
ovlp
(vrt0, match_ambig=False)¶ Returns variants altering the same positions as
vrt0
.- Parameters
vrt0 (VariantBase-like) – Variant to search overlap with.
match_ambig (bool) – If False ambigous indels treated as regular indels. Default: False
- Returns
- Return type
list of variants
-
to_vcf
(out, reference=None, info_spec=None, format_spec=None, samples=None)¶ Writes a minimal VCF with variants from the set.
INFO and SAMPLE data from source data is not preserved.
- Parameters
out (handle or str) – If string then it’s path to file, otherwise a handle to write variants.
reference (Reference or str, optional) – If string then it’s path to reference FASTA. Otherwise object of
genomvar.Reference
is expected. Not necessary if self was instantiated with a reference. If given, this argument takes precedence over the ref used at instantiation.info_spec (list of tuples containing specification of) – an INFO field in Order per VCF spec, i.e. NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION. Only the first three fiels (name, number and type) are required. If variant set was instantiated from VCF file spec obtained from it will be used. Providing empty sequence will omit INFO fields in output.
format_spec (spec of FORMAT fields, similar to info_spec) –
- Returns
- Return type
None
Examples
>>> with open(fname, 'wt') as fh: ... vs.to_vcf(fh, info_spec=[('DP4', 4, 'Integer'), ... ('NSV', 1, 'Integer')])
-
property
chroms
¶ Chromosome set
-
Module vcf.py¶
Module contains classes needed for VCF parsing.
Main class is VCFReader
which is instantiated from VCF file.
VCF can be gzipped. Bgzipping and tabix-derived indexing is also
supported for random coordinate-based access.
VCFReader
class can iterate over rows which are tuple-like object
containing VCF field strings as attributes without conversion (except
POSition which is converted to int).
Alternatively iteration of variants is supported. It yields
genomvar.variant.GenomVariant
objects.
>>> reader = VCFReader('example.vcf.gz')
>>> vrt = next(reader.iter_vrt(parse_info=True, parse_samples=True))
>>> print(vrt)
<GenomVariant: Del chr24:23-24 G/->
>>> print(vrt.attrib['info'])
{'NSV': 2, 'AF': 0.5, 'DP4': (11, 22, 33, 44), 'ECNT': 1, 'pl': 3, 'mt': 'SUBSTITUTE', 'RECN': 18, 'STR': None}
If parse_info
and parse_samples
parameters are True
Then
INFO and SAMPLEs fields contained in VCF are parsed and split
corresponding to an allele captured by variant object. For performance
reasons these parameters are set to False and these fields are not
parsed.
.bcf format is supported, use BCFReader
.
-
class
genomvar.vcf.
VCFWriter
(reference=None, info_spec=None, format_spec=None, samples=None)¶ Class for writing variant to VCF format.
-
get_row
(vrt, **kwds)¶ Formats a variant to a
genomvar.vcf_utils.VCFRow
instance.For indels writer with reference might be needeed to costruct correct REF field.
- Parameters
vrt (Variant instance) – variant to get row for.
kwds (VCF fields #FIXME remove kwds here) –
optional. If given, these and only these parameters are used to populate corresponding VCF fields:
id
,qual
,filter
,info
,sampdata
. These parameters are taken as is and converted to string before returning a VCFRow. If the keyword is not given corresponding field will be populated fromattrib
attribute if possible.Any other keyword arguments are ignored.
- Returns
row –
genomvar.vcf_utils.VCFRow
object instance- Return type
Notes
>>> factory = variant.VariantFactory() >>> writer = VCFWriter() >>> v1 = factory.from_edit('chr24', 2093, 'TGG', 'CCC') >>> row = writer.get_row(v1) >>> row <VCFRow chr24:2094 TGG->CCC> >>> print(row) chr24 2094 . TGG CCC . . .
>>> print(writer.get_row(v1, id=123, info='DP=10')) chr24 2094 123 TGG CCC . . DP=10
-
-
class
genomvar.vcf.
VCFReader
(vcf, index=None, reference=None)¶ Class to read VCF files.
-
get_factory
(normindel=False)¶ Returns a factory based on normindel parameter.
-
find_rows
(chrom=None, start=None, end=None, rgn=None)¶ Yields rows of variant file
-
iter_rows
(check_order=False)¶ Yields rows of variant file
-
iter_rows_by_chrom
(check_order=False)¶ Yields rows grouped by chromosome
-
get_records
(parse_info=False, parse_samples=False, normindel=False)¶ Returns parsed variant data as a dict of NumPy arrays with structured dtype.
-
iter_vrt
(check_order=False, parse_info=False, normindel=False, parse_samples=False)¶ Yields variant objects.
- Parameters
parse_info (bool) – Whether INFO fields should be parsed. Default: False
parse_samples (bool) – Whether SAMPLEs dat should be parsed. Default: False
check_order (bool) – If True will raise exception on unsorted VCF rows. Default: False
normindel (bool) – If True insertions and deletions will be left normalized. Requires a reference on
VCFReader
instantiation.
- Yields
vrt (
genomvar.variant.GenomVariant
) – Variant object
-
iter_vrt_by_chrom
(parse_info=False, parse_samples=False, normindel=False, check_order=False)¶ Generates variants grouped by chromosome
- Parameters
parse_info (bool) – Incates whether INFO fields should be parsed. Default: False
parse_samples (bool) – Incates whether SAMPLE data should be parsed. Default: False
parse_samples – If True indels will be normalized.
VCFReader
should have been instantiated with reference. Default: Falsecheck_order (bool) – If True VCF will be checked for sorting. Default: False
- Yields
(chrom,it) (tuple of str and iterator) –
it
yieldsvariant.Genomvariant
objects
-
find_vrt
(chrom=None, start=0, end=2147483647, check_order=False, parse_info=False, normindel=False, parse_samples=False)¶ Yields variant objects from a specified region
-
get_chroms
(allow_no_index=False)¶ Returns
ChromSet
corresponding to VCF. If indexed then index is used for faster access. Alternatively ifallow_no_index
is True the whole file is parsed to get chromosome ordering.
-
-
class
genomvar.vcf.
BCFReader
(bcf, index=False, reference=None)¶ -
iter_rows
(check_order=None)¶ Yields rows of variant file
-
find_rows
(chrom=None, start=None, end=None, rgn=None)¶ Yields rows of variant file
-
find_vrt
(chrom=None, start=0, end=2147483647, check_order=False, parse_info=False, normindel=False, parse_samples=False)¶ Yields variant objects from a specified region
-
get_chroms
(allow_no_index=False)¶ Returns
ChromSet
corresponding to VCF. If indexed then index is used for faster access. Alternatively ifallow_no_index
is True the whole file is parsed to get chromosome ordering.
-
get_factory
(normindel=False)¶ Returns a factory based on normindel parameter.
-
get_records
(parse_info=False, parse_samples=False, normindel=False)¶ Returns parsed variant data as a dict of NumPy arrays with structured dtype.
-
iter_rows_by_chrom
(check_order=False)¶ Yields rows grouped by chromosome
-
iter_vrt
(check_order=False, parse_info=False, normindel=False, parse_samples=False)¶ Yields variant objects.
- Parameters
parse_info (bool) – Whether INFO fields should be parsed. Default: False
parse_samples (bool) – Whether SAMPLEs dat should be parsed. Default: False
check_order (bool) – If True will raise exception on unsorted VCF rows. Default: False
normindel (bool) – If True insertions and deletions will be left normalized. Requires a reference on
VCFReader
instantiation.
- Yields
vrt (
genomvar.variant.GenomVariant
) – Variant object
-
iter_vrt_by_chrom
(parse_info=False, parse_samples=False, normindel=False, check_order=False)¶ Generates variants grouped by chromosome
- Parameters
parse_info (bool) – Incates whether INFO fields should be parsed. Default: False
parse_samples (bool) – Incates whether SAMPLE data should be parsed. Default: False
parse_samples – If True indels will be normalized.
VCFReader
should have been instantiated with reference. Default: Falsecheck_order (bool) – If True VCF will be checked for sorting. Default: False
- Yields
(chrom,it) (tuple of str and iterator) –
it
yieldsvariant.Genomvariant
objects
-
Module vcf_utils.py¶
-
class
genomvar.vcf_utils.
VCF_INFO_OR_FORMAT_SPEC
(NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION)¶ -
DESCRIPTION
¶ Alias for field number 3
-
NAME
¶ Alias for field number 0
-
NUMBER
¶ Alias for field number 1
-
SOURCE
¶ Alias for field number 4
-
TYPE
¶ Alias for field number 2
-
VERSION
¶ Alias for field number 5
-
count
(value, /)¶ Return number of occurrences of value.
-
index
(value, start=0, stop=9223372036854775807, /)¶ Return first index of value.
Raises ValueError if the value is not present.
-
-
class
genomvar.vcf_utils.
VCFRow
(CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT=None, SAMPLES=None, rnum=None)¶ Tuple-like object storing variant information in VCF-like form.
str() returns a string, formatted as a row in VCF file.
-
genomvar.vcf_utils.
issequence
(seq)¶ Is seq a sequence (ndarray, list or tuple)?