xyalign.variants module¶
-
class
xyalign.variants.
VCFFile
(filepath, bgzip='bgzip', tabix='tabix', no_initial_compress=False)[source]¶ A class for working with external vcf files.
Attributes
filepath (str) Full path to external vcf file bgzip (str) Full path to bgzip. Default = ‘bgzip’ tabix (str) Full path to tabix. Default = “tabix” -
is_bgzipped
()[source]¶ Checks to see if vcf file is gzipped, simply by looking for a .gz or .bgz ending. If .gz or .bgz ending exists, assumes file is compressed using bgzip.
Returns: bool
True if ends in .gz, False otherwise
-
compress_vcf
()[source]¶ Compresses vcf file using bgzip.
Returns: bool
True if successful
Raises: RuntimeError
If return code from external call is not 0
-
index_vcf
()[source]¶ Indexes vcf file using tabix. If file does not end in .gz, will compress with bgzip (by calling self.compress_vcf).
Note: Files MUST be compressed using bgzip.
Returns: bool
True if successful.
Raises: RuntimeError
If return code from external call is not 0.
-
parse_platypus_VCF
(site_qual, genotype_qual, depth, chrom)[source]¶ Parse vcf generated by Platypus to grab read balance. Note that this is hard-coded to Platypus (version 0.8.1) and will not generalize to vcfs generated with other programs (and, potentially, other versions of Platypus)
Parameters: site_qual : int
Minimum (PHRED) site quality at which sites should be included
genotype_qual : int
Minimum (PHRED) genotype quality at which sites should be included
depth : int
Minimum depth at which sites should be included
chrom : str
Name of the chromosome to include
Returns: tuple
- five corresponding arrays of the same length:
(position across the chromosome, site quality, read balance, genotype quality, and depth)
-
plot_variants_per_chrom
(chrom_list, sampleID, output_prefix, site_qual, genotype_qual, depth, MarkerSize, MarkerAlpha, bamfile_obj, variant_caller, homogenize, dataframe_out, min_count, window_size, x_scale=1000000, target_file=None, include_fixed=False)[source]¶ Parses a vcf file and plots read balance in separate plots for each chromosome in the input list
Parameters: chrom_list : list
Chromosomes to include
sampleID : str
Sample ID (for plot titles)
output_prefix : str
Full path to and prefix of desired output plots
site_qual : int
Minimum (PHRED) site quality at which sites should be included
genotype_qual : int
Minimum (PHRED) genotype quality at which sites should be included
depth : int
Minimum depth at which sites should be included
MarkerSize : float
Size of markers (matplotlib sizes) to use in the figure
MarkerAlpha : float
Transparency (matplotlib values, 0 to 1) of markers
bamfile_obj : BamFile() object
Used to get chromosome lengths only
variant_caller : str
Variant caller used to generate vcf - currently only “platypus” supported
homogenize: bool
If True, all read balance values less than 0.5 will be transformed by subtracting the value from 1. For example, the values 0.25 and 0.75 would be treated as equivalent.
dataframe_out : str
Full path of file to write pandas dataframe to. Will overwire if exists
min_count : int
Minimum number of variants to include a window for plotting.
window_size
If int, the window size to use for sliding window analyses, if None intervals from target_file
x_scale : int
Divide all x values (including Xlim) by this value. Default is 1000000 (1MB)
target_file : str
Path to bed_file containing regions to analyze instead of windows of a fixed size. Will only be engaged if window_size is None
include_fixed : bool
If False, only plots histogram for values between 0.05 and 1.0. If True, plots histogram of all variants.
Returns: int
0 if variants to analyze; 1 if no variants to analyze on any chromosome
-
-
xyalign.variants.
read_balance_per_window
(chrom, positions, readBalance, sampleID, homogenize, chr_len, window_size, target_file=None)[source]¶ Calculates mean read balance per genomic window (defined by size or an external target bed file) for a given chromosome. Takes as input an array of positions and an array of read balances - the order of which must correspond exactly. In addition, the positions are expected to ALL BE ON THE SAME CHROMOSOME and be in numerically sorted order (i.e., the output of parse_platypus_VCF())
Parameters: chrom : str
Name of the chromosome
positions : numpy array
Positions along the chromosome (same length as readBalance)
readBalance : numpy array
Read balance corresponding with the positions in the positions array
sampleID : str
Sample name or id to include in the plot title
homogenize: bool
If True, all read balance values less than 0.5 will be transformed by subtracting the value from 1. For example, the values 0.25 and 0.75 would be treated as equivalent.
chr_len : int
Length of chromosome. Ignored if target_file is provided.
window_size
If int, the window size to use for sliding window analyses, if None intervals from target_file
target_file : str
Path to bed file containing regions to analyze instead of windows of a fixed size. Will only be engaged if window_size is None
Returns: pandas dataframe
With columns: “chrom”, “start”, “stop”, “balance”, and “count”
-
xyalign.variants.
plot_read_balance
(chrom, positions, readBalance, sampleID, output_prefix, MarkerSize, MarkerAlpha, homogenize, chrom_len, x_scale=1000000)[source]¶ Plots read balance at each SNP along a chromosome
Parameters: chrom : str
Name of the chromosome
positions : numpy array
Positions along the chromosome (same length as readBalance)
readBalance : numpy array
Read balance corresponding with the positions in the positions array
sampleID : str
Sample name or id to include in the plot title
output_prefix : str
Desired prefix (including full path) of the output files
MarkerSize : float
Size of markers (matplotlib sizes) to use in the figure
MarkerAlpha : float
Transparency (matplotlib values) of markers for the figure
homogenize: bool
If True, all read balance values less than 0.5 will be transformed by subtracting the value from 1. For example, the values 0.25 and 0.75 would be treated as equivalent.
chrom_len : int
Length of chromosome
x_scale : int
Divide all x values (including Xlim) by this value. Default is 1000000 (1MB)
Returns: int
0
-
xyalign.variants.
hist_read_balance
(chrom, readBalance, sampleID, homogenize, output_prefix, include_fixed=False)[source]¶ Plots a histogram of read balance values between 0.05 and 1.0 (non-incusive)
Parameters: chrom : str
Name of the chromosome
readBalance : list or numpy array
Read balance values
sampleID : str
Sample name or id to include in the plot title
homogenize: bool
If True, all read balance values less than 0.5 will be transformed by subtracting the value from 1. For example, the values 0.25 and 0.75 would be treated as equivalent.
output_prefix : str
Desired prefix (including full path) of the output files
include_fixed : bool
If False, only plots histogram for values between 0.05 and 1.0. If True, plots histogram of all variants.
Returns: int
0 if plotting successful, 1 otherwise.