Package contents¶
Module variant.py¶
Module genomvar.variant contains classes representing genomic
alterations.
The hierarchy of the classes used in the package is the following:
VariantBase
/ | \
AmbigIndel <-Indel V |
| | / \ MNP V
| ----+--- Del Ins | Haplotype
| | | | V
V V | V SNP
AmbigDel -----> AmbigIns
All variants have start and end attributes defining a range they
act on and can be searched overlap for.
To test whether a variant is instance of some type
is_variant_instance() method can be used. Variant equality
can be tested using edit_equal().
Objects can be instantiated directly, e.g.:
>>> vrt = variant.MNP('chr1',154678,'GT')
>>> print(vrt)
<MNP chr1:154678-154680 NN/GT>
This will create an MNP which substitutes positions 154678 and 154679 on
chromosome 1 for GT.
Alternatively variants can be created using VariantFactory
objects. This class can work with VCF-like notation. For example,
>>> fac = VariantFactory()
>>> vrt = fac.from_edit('chr15',575,'TA','T')
>>> print(vrt)
<Del chr15:576-577 A/->
Position is 0-based so it creates a deletion at position 577 of chromosome 15.
Alternatively, limited subset of HGVS notation is supported (numbering in HGVS strings is 1-based following the spec):
>>> vrt = fac.from_hgvs('chr1:g.15C>A')
>>> print(vrt)
<SNP chr1:14 C/A>
Variant sets defined in genomvar.varset use class GenomVariant.
Objects of this class contain genomic alteration (attribute base)
and optionally, genotype (attribute GT) and other attributes
commonly found in VCF files (attribute attrib). Attribute
base is an object of some VariantBase subclass (SNPs, Deletions
etc.).
-
class
genomvar.variant.VariantBase(chrom, start, end, ref, alt)¶ Base class for the other genomic variant classes.
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
property
vtp¶ Variant class
-
property
key¶ Tuple representation of variant.
-
-
class
genomvar.variant.MNP(chrom, start, alt, end=None, ref=None)¶ Multiple-nucleotide polymorphism. Substitute N nucleotides of the reference for N other nucleotides.
For instantiation it requires chromosome,position and alternative sequence.
endwill inferred fromstartandalt.refis also optional.>>> from genomvar import variant >>> variant.MNP('chr1',154678,'GT') MNP("chr1",154678,"GT")
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.SNP(chrom, start, alt, end=None, ref=None)¶ Single-nucleotide polymorphism.
For instantiation it requires chromosome, position and alternative sequence.
>>> from genomvar import variant >>> variant.SNP('chr1',154678,'T') SNP("chr1",154678,"T")
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.Indel(chrom, start, end, ref, alt)¶ Abstract class to accomodate insertion/deletion subclasses
-
ambig_equal(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.Ins(chrom, start, alt, end=None, ref=None)¶ Insertion of nucleotides. For instantiation
chrom,startand inserted sequence (alt) are required.Start and end denote the nucleotide after the inserted sequence, i.e.
startis 0-based number of a nucleotide after insertion,endisstart+1by definition.>>> from genomvar.variant import Ins # Insertion of TA before position chr2:100543. >>> print(Ins('chr2',100543,'TA')) <Ins chr2:100543 -/TA>
-
ambig_equal(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.Del(chrom, start, end, ref=None, alt=None)¶ Deletion of nucleotides. For instantiation
chrom,start(0-based), andend(positionendis excluded) are required.>>> from genomvar.variant import Del # Deletion of 3 nucleotides starting at chr3:7843488 (0-based) >>> print(Del('chr3',7843488,7843488+3)) <Del chr3:7843488-7843491 NNN/->
-
ambig_equal(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.AmbigIndel(chrom, start, end, ref, alt)¶ Class representing indel which position is ambigous. Ambiguity means the indel could be applied in any position of some region resulting in the same alternative sequence.
-
ambig_equal(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.AmbigIns(chrom, start, end, alt, ref=None)¶ Class representing indel which position is ambigous. Ambiguity means the indel could be applied in any position of some region resulting in the same alternative sequence.
Let the reference file
test.fastacontain a toy sequence:>seq1 TTTAATA
Consider a variant extending 3
Ts in the beginning by one more T. It can be done in several places so the corresponding insertion can be given as andAmbigInsobject:>>> from genomvar import Reference >>> from genomvar.variant import VariantFactory >>> fac = VariantFactory(Reference('test.fasta'),normindel=True) >>> print( fac.from_edit('seq1',0,'T','TT') ) <AmbigIns seq1:0-4(1-2) -/T>
Positions 1 and 2 are actual start and end meaning that T is inserted before nucleotide located 1-2. Positions 0-4 indicate that start and end can be extended to these values resulting in the same alteration.
-
property
seq¶ Inserted sequence.
-
ambig_equal(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
property
-
class
genomvar.variant.AmbigDel(chrom, start, end, ref=None, alt=None)¶ Class representing del which position is ambigous. Ambiguity means the same number of positions could be deleted in some range resulting in the same aternative sequence.
Let the reference file
test.fastacontain a toy sequence:>seq1 TCTTTTTGACTGG
>>> fac = VariantFactory(Reference('test.fasta'),normindel=True) >>> print( fac.from_edit('seq1',1,'CTTTTTGAC','C') ) <AmbigDel seq1:1-11(2-10) TTTTTGAC/->
Deletion of TTTTTGAC starts at 2 and ends on 9th nucleotide (including 9th resulting in range 2-10). 1-11 denote that start and end can be extended to these values resulting in the same alteration.
-
property
seq¶ Deleted sequence.
-
ambig_equal(other)¶ Returns true if two variants are equal up to indel ambiguity.
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
property
-
class
genomvar.variant.Mixed(chrom, start, end, alt, ref=None)¶ Combination of Indel and MNP. Usage of this class is discouraged and exists for compatibility.
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.Haplotype(chrom, variants)¶ An object representing genome variants on the same chromosome (or contig).
Can be instantiated from a list of GenomVariant objects using
Haplotype.from_variants()class method.-
find_vrt(start=0, end=2147483647)¶ Finds variants in specified region of a haplotype.
- Parameters
start (int) – Start of search interval. Default: 0
end (int) – End of search interval. Defaults to MAX_END
- Yields
vrt (variants)
-
classmethod
from_variants(variants)¶ Create haplotype from a list of variants.
- Parameters
variants (list of variants) – Variants to instantiate haplotype from.
- Returns
hap – Haplotype variant object.
- Return type
-
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.Null(chrom, start, end, ref, alt)¶ -
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.Asterisk(chrom, start, end, ref, alt)¶ -
edit_equal(other)¶ Returns True if
selfrepresents the same alteration as theother
-
is_variant_instance(cls)¶ Returns True if object’s variant class is subclass of
cls. See modulegenomvar.variantdocumentation for class hierarchy.
-
classmethod
is_variant_subclass(other)¶ Checks is variant class is subclass of
other. See modulegenomvar.variantdocumentaion for class hierarchy.
-
property
key¶ Tuple representation of variant.
-
property
vtp¶ Variant class
-
-
class
genomvar.variant.GenomVariant(base, attrib=None)¶ This is a variant class holding gemomic alteration (a VariantBase subclass object) and extra attributes. Underlying variant is accessible as
baseattribute. On top it has genotype asGTattribute andattribcontaining extra attributes parsed from VCF files. All attributes of underlyingbase(start, end etc.) are accessible from GenomVariant object.-
edit_equal(other)¶ Check if GenomVariant holds the same genome alteration as
other.- Parameters
variant (GenomVariant or any VariantBase) – variant to compare
- Returns
equality – True if the same genomic alteration.
- Return type
bool
-
-
class
genomvar.variant.VariantFactory(reference=None, normindel=False)¶ Factory class used to create Variant objects. Can be instantiated without arguments or an instance of
genomvar.Referencecan be given. Additionnaly ifnormindel=True(reference should be given) indels will be checked for ambiguity, and if ambigous instance ofgenomvar.variant.AmbigIndelwill be returned.>>> from genomvar import Reference >>> from genomvar.variant import VariantFactory >>> reference = Reference('test.fasta') >>> fac = VariantFactory(reference,normindel=True)
-
from_hgvs(st)¶ Returns a variant from a HGVS notation string. Only some of possible HGVS notations are supported as listed below:
>>> fac = VariantFactory() >>> print(fac.from_hgvs('chr1:g.15C>A')) <SNP chr1:14 C/A> >>> print(fac.from_hgvs('chrW:g.19_21del')) <Del chrW:18-21 NNN/-> >>> print(fac.from_hgvs('chr24:g.10_11insCCT')) <Ins chr24:10 -/CCT> >>> print(fac.from_hgvs('chr24:g.10delinsGA')) <Mixed chr24:9-10 -/GA> >>> print(fac.from_hgvs('chr23:g.145_147delinsTGG')) <MNP chr23:144-147 NNN/TGG>
-
from_edit(chrom, start, ref, alt)¶ Takes chrom, start, ref and alt (similar to a row in VCF file, but start is 0-based) and returns a variant object.
>>> vrt = fac.from_edit('chr15',575,'TA','T') >>> print(vrt) <Del chr15:576-577 A/->
Method attempts to strip ref and alt sequences from left and right (in this order), for example,
>>> vrt = fac.from_edit('chr15',start=65,ref='CTG',alt='CTC') >>> print(vrt) <SNP chr15:67-68 G/C>
-
Module varset.py¶
Module varset defines variant set classes which are containers of variants
supporting set-like operations, searching variants, export to NumPy…
There are two classes depending on whether all the variants are loaded into memory or not:
VariantSetin-memory
VariantSetFromFilecan read VCF files on demand, needs index for random access
-
class
genomvar.varset.VariantSet(variants, attrib=None, vcf_notation=None, info=None, sampdata=None, reference=None, dtype=None, samples=None)¶ Immutable class for storing variants in-memory.
Supports random access by variant location. Can be instantiated from a VCF file (see
VariantSet.from_vcf()) or from iterable containing variants (seeVariantSet.from_variants()).-
reference¶ Genome reference.
- Type
Reference
-
ctg_len¶ dict mapping Chromosome name to corresponding lengths.
This keeps track of the rightmost position of variants in the set if reference was not provided. If reference was given
ctg_lenis taken from length of chromosomes in the reference.- Type
dict
-
sample(size)¶ Returns a random sample of variants.
Haplotypes (if present) maybe returned equally likely as non-haplotype variants and their subvariants.
- Parameters
size (int) – size of the sample
- Returns
marray – List of variants.
- Return type
list
-
iter_vrt(expand=False)¶ Iterate over all variants in the set.
- Parameters
expand (bool, optional) – If
Truevariants in encountered haplotypes are yielded. Default: False- Yields
-
copy()¶ Returns a copy of variant set
-
nof_unit_vrt()¶ Returns number of simple genomic variations in a variant set.
SNPs and indels count as 1. MNPs as number of affected nucleotides. Mixed variants (if present) are count also as 1.
-
drop_duplicates(return_dropped=False)¶ Remove non-unique variants.
Returns a variant set where no pair of variants is edit-equal. Haplotypes are left as is.
- Parameters
return_dropped (bool) – Whether to return a list of dropped variants. Default: False
- Returns
vs (VariantSet) – Variant set with unique variants.
dropped (list of variants, optional) – if
return_droppedis True dropped variants are also returned.
-
classmethod
from_vcf(vcf, reference=None, parse_info=False, parse_samples=False, normindel=False, duplicates='ignore', parse_null=False)¶ Parse VCF variant file and return VariantSet object
- Parameters
vcf (str) – VCF file to read data from.
reference (
genomvar.Reference, optional) – Reference genome. Default: Noneparse_info (bool, optional) – If
TrueINFO fields will be parsed. Default: False.parse_samples (bool, str or list of str, optional) – If
Trueall sample data will be parsed. If string, sample with this name will be parsed. If list of strings only these samples will be parsed. Default: Falsenormindel (bool, optional) – If
Trueindels will be normalized.referenceshould also be provided. Default: Falseduplicates ({'ignore','keepfirst','raise'}) – How to treat duplicate variants. If
keepfirstonly the first is taken. Ifraiseerror is raised. Default: ``ignore`` and do not check for duplicates.parse_null (bool, optional) – If
Truenull variants will also be parsed (experimental). Default: False
- Returns
- Return type
-
classmethod
from_variants(variants, reference=None)¶ Instantiate VariantSet from iterable of variants.
- Parameters
variants (iterable) – Variants to add.
reference (Reference, optional) – Genome reference Default: None
- Returns
- Return type
-
to_records(nested=True)¶ Export
selfto NumPy ndarray.- Parameters
nested (bool, optional) – Matters only if
selfwas instatiated with INFO or SAMPDATA. WhenTruedtype of the return will be nested. IfFalseno recursive fields will be present and “info_” or “SAMP_%SAMP%” will be prepended to corresponding fields Default: True- Returns
- Return type
NumPy ndarray with structured dtype
-
property
chroms¶ Chromosome set
-
sort_chroms(key=None)¶ Sorts chromosomes in the set
This can influence order in which methods return variants and variant set comparison.
- Parameters
key (callable, optional) – key to use for chromosome sorting. If not given will sort lexicographically. Default: None
- Returns
- Return type
None
-
diff(other, match_partial=True, match_ambig=False)¶ Returns a new variant set object which has variants present in
selfbut absent in theother.- Parameters
match_partial (bool) – If
Falsecommon positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool) – If
Falseambigous indels will be treated as regular indels. Default: False
- Returns
diff – different variants
- Return type
Notes
Here is an example of
diffoperation on two variant sets s1 and s2. Positions identical to REF are empty:REF GATTGGTAC --------------------- s1 C CCC T CCC G --------------------- s2 T AC T ===================== s1.diff(s2) C CC CC G --------------------- s2.diff(s1) A
-
comm(other, match_partial=True, match_ambig=False)¶ Returns a new variant set object which has variants present both
selfandother.- Parameters
match_partial (bool) – If
Falsecommon positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool) – If
Falseambigous indels will be treated as regular indels. Default: False
- Returns
comm – Common variants.
- Return type
Notes
Here is an example of
diffoperation on two variant sets s1 and s2. Positions identical to REF are empty:REF GATTGGTAC --------------------- s1 C CCC T CCC G --------------------- s2 T AC T ===================== s1.comm(s2) C T C
-
comm_vrt(other, match_partial=True, match_ambig=False)¶ Generate variants common between
selfandother.This creates a lazily evaluated object of class
CmpSet. This object can be further inspected with.iter_vrt()or.region()- Parameters
other (variant set) – Variant set to compare against
selfmatch_partial (bool, optional) – If
Falsecommon positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool, optional) – If
Falseambigous indels will be treated as regular indels. Default: False
- Returns
Transient comparison set
- Return type
-
diff_vrt(other, match_partial=True, match_ambig=False)¶ Generate variants different between
selfandother.This creates a lazily evaluated object of class
CmpSet. This object can be further inspected with.iter_vrt()or.region()- Parameters
other (variant set) – Variant set to compare against
selfmatch_partial (bool, optional) – If
Falsecommon positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool, optional) – If
Falseambigous indels will be treated as regular indels. Default: False
- Returns
Transient comparison set
- Return type
-
find_vrt(chrom=None, start=0, end=2147483647, rgn=None, expand=False, check_order=False)¶ Finds variants in specified region.
If not
chromnorrgnparameters are given it traverses all variants.- Parameters
chrom (str, optional) – Chromosome to search variants on. If not given, all chroms are traversed and
startandendparameters ignored.start (int, optional) – Start of search region (0-based). Default: 0
end (int, optional) – End of search region (position
endis excluded). Default: max chrom lengthrgn (str, optional) – If given chrom, start and end parameters are ignored. A string specifying search region, e.g.
chr15:365,008-365,958.expand (bool, optional) – Indicates whether to traverse haplotypes. Default: False
- Returns
- Return type
List of
genomvar.variant.VariantBasesubclass objects
-
get_factory(normindel=False)¶ Returns a VariantFactory object. It will inherit the same reference as was used for instantiation of
self
-
iter_vrt_by_chrom(**kwds)¶ Iterates variants grouped by chromosome.
-
match(vrt, match_partial=True, match_ambig=False)¶ Find variants mathing vrt.
The method checks whether variant
vrtis present in the set.- Parameters
vrt (VariantBase-like) – Variant to search match for.
match_partial (bool, optional) – If
Falsecommon positions won’t be searched in non-identical MNPs (relevant ifvrtis MNP or SNP or haplotype, containing these). Default: Truematch_ambig (bool, optional) – If
Falseambigous indels will be treated as regular indels (relevant ifvrtis Ambigindel or haplotype, containing ones). Default: False
- Returns
dict is returned only of
vrtis a Haplotype- Return type
list or dict with matching variants
-
ovlp(vrt0, match_ambig=False)¶ Returns variants altering the same positions as
vrt0.- Parameters
vrt0 (VariantBase-like) – Variant to search overlap with.
match_ambig (bool) – If False ambigous indels treated as regular indels. Default: False
- Returns
- Return type
list of variants
-
to_vcf(out, reference=None, info_spec=None, format_spec=None, samples=None)¶ Writes a minimal VCF with variants from the set.
INFO and SAMPLE data from source data is not preserved.
- Parameters
out (handle or str) – If string then it’s path to file, otherwise a handle to write variants.
reference (Reference or str, optional) – If string then it’s path to reference FASTA. Otherwise object of
genomvar.Referenceis expected. Not necessary if self was instantiated with a reference. If given, this argument takes precedence over the ref used at instantiation.info_spec (list of tuples containing specification of) – an INFO field in Order per VCF spec, i.e. NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION. Only the first three fiels (name, number and type) are required. If variant set was instantiated from VCF file spec obtained from it will be used. Providing empty sequence will omit INFO fields in output.
format_spec (spec of FORMAT fields, similar to info_spec) –
- Returns
- Return type
None
Examples
>>> with open(fname, 'wt') as fh: ... vs.to_vcf(fh, info_spec=[('DP4', 4, 'Integer'), ... ('NSV', 1, 'Integer')])
-
-
class
genomvar.varset.CmpSet(left, right, op, match_partial=True, match_ambig=False)¶ Class
CmpSetis a lazy implementation of variant set comparison which is returned by methodsdiff_vrtandcomm_vrtof any variant set class instance. Useful for comparisons including large VCF files while keeping memory profile low.-
region(chrom=None, start=0, end=2147483647, rgn=None)¶ Generates variants corresponding to comparison within a region.
Region can be specified using chrom,start,end or rgn with notation like
chr15:1000-2000- Parameters
chrom (str) – Chromosome
start (int, optional) – Start of region. Default: 0
end (int) – End of region. Default: max chrom length
rgn (str) – Region to yield variants from
- Yields
-
iter_vrt(callback=None)¶ Generates variants corresponding to comparison.
- Parameters
callback (callable) – A function to be called on a match in right variants set for every variant of left variant set. Result will be stored in
vrt.attrib['cmp']of yielded variant- Yields
variants (
genomvar.variant.GenomVariantor tuple (int,genomvar.variant.GenomVariant)) – ifactionisdifforcommfunction yields instances ofgenomvar.variant.GenomVariant. Ifactionisallfunction yields tuples whereintis 0 if variant is present in first set; 1 if in second; 2 if in both.
-
-
class
genomvar.varset.VariantSetFromFile(file, reference=None, index=None, parse_info=False, parse_samples=False)¶ Variant set representing variants contained in underlying VCF file specified on initialization.
-
file¶ Variant file storing variants.
- Type
str
-
get_factory()¶ Returns a VariantFactory object. It will inherit the same reference as was used for instantiation of
self
-
iter_vrt_by_chrom(check_order=False)¶ Iterates variants grouped by chromosome.
-
comm_vrt(other, match_partial=True, match_ambig=False)¶ Generate variants common between
selfandother.This creates a lazily evaluated object of class
CmpSet. This object can be further inspected with.iter_vrt()or.region()- Parameters
other (variant set) – Variant set to compare against
selfmatch_partial (bool, optional) – If
Falsecommon positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool, optional) – If
Falseambigous indels will be treated as regular indels. Default: False
- Returns
Transient comparison set
- Return type
-
diff_vrt(other, match_partial=True, match_ambig=False)¶ Generate variants different between
selfandother.This creates a lazily evaluated object of class
CmpSet. This object can be further inspected with.iter_vrt()or.region()- Parameters
other (variant set) – Variant set to compare against
selfmatch_partial (bool, optional) – If
Falsecommon positions won’t be searched in non-identical MNPs. Default: Truematch_ambig (bool, optional) – If
Falseambigous indels will be treated as regular indels. Default: False
- Returns
Transient comparison set
- Return type
-
find_vrt(chrom=None, start=0, end=2147483647, rgn=None, expand=False, check_order=False)¶ Finds variants in specified region.
If not
chromnorrgnparameters are given it traverses all variants.- Parameters
chrom (str, optional) – Chromosome to search variants on. If not given, all chroms are traversed and
startandendparameters ignored.start (int, optional) – Start of search region (0-based). Default: 0
end (int, optional) – End of search region (position
endis excluded). Default: max chrom lengthrgn (str, optional) – If given chrom, start and end parameters are ignored. A string specifying search region, e.g.
chr15:365,008-365,958.expand (bool, optional) – Indicates whether to traverse haplotypes. Default: False
- Returns
- Return type
List of
genomvar.variant.VariantBasesubclass objects
-
iter_vrt(expand=False)¶ Iterate over all variants in underlying VCF.
- Parameters
expand (bool, optional) – If
Truevariants in encountered haplotypes are yielded. Default: False- Yields
variant (
genomvar.variant.GenomVariant)
-
match(vrt, match_partial=True, match_ambig=False)¶ Find variants mathing vrt.
The method checks whether variant
vrtis present in the set.- Parameters
vrt (VariantBase-like) – Variant to search match for.
match_partial (bool, optional) – If
Falsecommon positions won’t be searched in non-identical MNPs (relevant ifvrtis MNP or SNP or haplotype, containing these). Default: Truematch_ambig (bool, optional) – If
Falseambigous indels will be treated as regular indels (relevant ifvrtis Ambigindel or haplotype, containing ones). Default: False
- Returns
dict is returned only of
vrtis a Haplotype- Return type
list or dict with matching variants
-
ovlp(vrt0, match_ambig=False)¶ Returns variants altering the same positions as
vrt0.- Parameters
vrt0 (VariantBase-like) – Variant to search overlap with.
match_ambig (bool) – If False ambigous indels treated as regular indels. Default: False
- Returns
- Return type
list of variants
-
to_vcf(out, reference=None, info_spec=None, format_spec=None, samples=None)¶ Writes a minimal VCF with variants from the set.
INFO and SAMPLE data from source data is not preserved.
- Parameters
out (handle or str) – If string then it’s path to file, otherwise a handle to write variants.
reference (Reference or str, optional) – If string then it’s path to reference FASTA. Otherwise object of
genomvar.Referenceis expected. Not necessary if self was instantiated with a reference. If given, this argument takes precedence over the ref used at instantiation.info_spec (list of tuples containing specification of) – an INFO field in Order per VCF spec, i.e. NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION. Only the first three fiels (name, number and type) are required. If variant set was instantiated from VCF file spec obtained from it will be used. Providing empty sequence will omit INFO fields in output.
format_spec (spec of FORMAT fields, similar to info_spec) –
- Returns
- Return type
None
Examples
>>> with open(fname, 'wt') as fh: ... vs.to_vcf(fh, info_spec=[('DP4', 4, 'Integer'), ... ('NSV', 1, 'Integer')])
-
property
chroms¶ Chromosome set
-
Module vcf.py¶
Module contains classes needed for VCF parsing.
Main class is VCFReader which is instantiated from VCF file.
VCF can be gzipped. Bgzipping and tabix-derived indexing is also
supported for random coordinate-based access.
VCFReader class can iterate over rows which are tuple-like object
containing VCF field strings as attributes without conversion (except
POSition which is converted to int).
Alternatively iteration of variants is supported. It yields
genomvar.variant.GenomVariant objects.
>>> reader = VCFReader('example.vcf.gz')
>>> vrt = next(reader.iter_vrt(parse_info=True, parse_samples=True))
>>> print(vrt)
<GenomVariant: Del chr24:23-24 G/->
>>> print(vrt.attrib['info'])
{'NSV': 2, 'AF': 0.5, 'DP4': (11, 22, 33, 44), 'ECNT': 1, 'pl': 3, 'mt': 'SUBSTITUTE', 'RECN': 18, 'STR': None}
If parse_info and parse_samples parameters are True Then
INFO and SAMPLEs fields contained in VCF are parsed and split
corresponding to an allele captured by variant object. For performance
reasons these parameters are set to False and these fields are not
parsed.
.bcf format is supported, use BCFReader.
-
class
genomvar.vcf.VCFWriter(reference=None, info_spec=None, format_spec=None, samples=None)¶ Class for writing variant to VCF format.
-
get_row(vrt, **kwds)¶ Formats a variant to a
genomvar.vcf_utils.VCFRowinstance.For indels writer with reference might be needeed to costruct correct REF field.
- Parameters
vrt (Variant instance) – variant to get row for.
kwds (VCF fields #FIXME remove kwds here) –
optional. If given, these and only these parameters are used to populate corresponding VCF fields:
id,qual,filter,info,sampdata. These parameters are taken as is and converted to string before returning a VCFRow. If the keyword is not given corresponding field will be populated fromattribattribute if possible.Any other keyword arguments are ignored.
- Returns
row –
genomvar.vcf_utils.VCFRowobject instance- Return type
Notes
>>> factory = variant.VariantFactory() >>> writer = VCFWriter() >>> v1 = factory.from_edit('chr24', 2093, 'TGG', 'CCC') >>> row = writer.get_row(v1) >>> row <VCFRow chr24:2094 TGG->CCC> >>> print(row) chr24 2094 . TGG CCC . . .
>>> print(writer.get_row(v1, id=123, info='DP=10')) chr24 2094 123 TGG CCC . . DP=10
-
-
class
genomvar.vcf.VCFReader(vcf, index=None, reference=None)¶ Class to read VCF files.
-
get_factory(normindel=False)¶ Returns a factory based on normindel parameter.
-
find_rows(chrom=None, start=None, end=None, rgn=None)¶ Yields rows of variant file
-
iter_rows(check_order=False)¶ Yields rows of variant file
-
iter_rows_by_chrom(check_order=False)¶ Yields rows grouped by chromosome
-
get_records(parse_info=False, parse_samples=False, normindel=False)¶ Returns parsed variant data as a dict of NumPy arrays with structured dtype.
-
iter_vrt(check_order=False, parse_info=False, normindel=False, parse_samples=False)¶ Yields variant objects.
- Parameters
parse_info (bool) – Whether INFO fields should be parsed. Default: False
parse_samples (bool) – Whether SAMPLEs dat should be parsed. Default: False
check_order (bool) – If True will raise exception on unsorted VCF rows. Default: False
normindel (bool) – If True insertions and deletions will be left normalized. Requires a reference on
VCFReaderinstantiation.
- Yields
vrt (
genomvar.variant.GenomVariant) – Variant object
-
iter_vrt_by_chrom(parse_info=False, parse_samples=False, normindel=False, check_order=False)¶ Generates variants grouped by chromosome
- Parameters
parse_info (bool) – Incates whether INFO fields should be parsed. Default: False
parse_samples (bool) – Incates whether SAMPLE data should be parsed. Default: False
parse_samples – If True indels will be normalized.
VCFReadershould have been instantiated with reference. Default: Falsecheck_order (bool) – If True VCF will be checked for sorting. Default: False
- Yields
(chrom,it) (tuple of str and iterator) –
ityieldsvariant.Genomvariantobjects
-
find_vrt(chrom=None, start=0, end=2147483647, check_order=False, parse_info=False, normindel=False, parse_samples=False)¶ Yields variant objects from a specified region
-
get_chroms(allow_no_index=False)¶ Returns
ChromSetcorresponding to VCF. If indexed then index is used for faster access. Alternatively ifallow_no_indexis True the whole file is parsed to get chromosome ordering.
-
-
class
genomvar.vcf.BCFReader(bcf, index=False, reference=None)¶ -
iter_rows(check_order=None)¶ Yields rows of variant file
-
find_rows(chrom=None, start=None, end=None, rgn=None)¶ Yields rows of variant file
-
find_vrt(chrom=None, start=0, end=2147483647, check_order=False, parse_info=False, normindel=False, parse_samples=False)¶ Yields variant objects from a specified region
-
get_chroms(allow_no_index=False)¶ Returns
ChromSetcorresponding to VCF. If indexed then index is used for faster access. Alternatively ifallow_no_indexis True the whole file is parsed to get chromosome ordering.
-
get_factory(normindel=False)¶ Returns a factory based on normindel parameter.
-
get_records(parse_info=False, parse_samples=False, normindel=False)¶ Returns parsed variant data as a dict of NumPy arrays with structured dtype.
-
iter_rows_by_chrom(check_order=False)¶ Yields rows grouped by chromosome
-
iter_vrt(check_order=False, parse_info=False, normindel=False, parse_samples=False)¶ Yields variant objects.
- Parameters
parse_info (bool) – Whether INFO fields should be parsed. Default: False
parse_samples (bool) – Whether SAMPLEs dat should be parsed. Default: False
check_order (bool) – If True will raise exception on unsorted VCF rows. Default: False
normindel (bool) – If True insertions and deletions will be left normalized. Requires a reference on
VCFReaderinstantiation.
- Yields
vrt (
genomvar.variant.GenomVariant) – Variant object
-
iter_vrt_by_chrom(parse_info=False, parse_samples=False, normindel=False, check_order=False)¶ Generates variants grouped by chromosome
- Parameters
parse_info (bool) – Incates whether INFO fields should be parsed. Default: False
parse_samples (bool) – Incates whether SAMPLE data should be parsed. Default: False
parse_samples – If True indels will be normalized.
VCFReadershould have been instantiated with reference. Default: Falsecheck_order (bool) – If True VCF will be checked for sorting. Default: False
- Yields
(chrom,it) (tuple of str and iterator) –
ityieldsvariant.Genomvariantobjects
-
Module vcf_utils.py¶
-
class
genomvar.vcf_utils.VCF_INFO_OR_FORMAT_SPEC(NAME, NUMBER, TYPE, DESCRIPTION, SOURCE, VERSION)¶ -
DESCRIPTION¶ Alias for field number 3
-
NAME¶ Alias for field number 0
-
NUMBER¶ Alias for field number 1
-
SOURCE¶ Alias for field number 4
-
TYPE¶ Alias for field number 2
-
VERSION¶ Alias for field number 5
-
count(value, /)¶ Return number of occurrences of value.
-
index(value, start=0, stop=9223372036854775807, /)¶ Return first index of value.
Raises ValueError if the value is not present.
-
-
class
genomvar.vcf_utils.VCFRow(CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT=None, SAMPLES=None, rnum=None)¶ Tuple-like object storing variant information in VCF-like form.
str() returns a string, formatted as a row in VCF file.
-
genomvar.vcf_utils.issequence(seq)¶ Is seq a sequence (ndarray, list or tuple)?