xyalign.variants module

class xyalign.variants.VCFFile(filepath, bgzip='bgzip', tabix='tabix', no_initial_compress=False)[source]

A class for working with external vcf files.

Attributes

filepath (str) Full path to external vcf file
bgzip (str) Full path to bgzip. Default = ‘bgzip’
tabix (str) Full path to tabix. Default = “tabix”
is_bgzipped()[source]

Checks to see if vcf file is gzipped, simply by looking for a .gz or .bgz ending. If .gz or .bgz ending exists, assumes file is compressed using bgzip.

Returns:

bool

True if ends in .gz, False otherwise

compress_vcf()[source]

Compresses vcf file using bgzip.

Returns:

bool

True if successful

Raises:

RuntimeError

If return code from external call is not 0

index_vcf()[source]

Indexes vcf file using tabix. If file does not end in .gz, will compress with bgzip (by calling self.compress_vcf).

Note: Files MUST be compressed using bgzip.

Returns:

bool

True if successful.

Raises:

RuntimeError

If return code from external call is not 0.

parse_platypus_VCF(site_qual, genotype_qual, depth, chrom)[source]

Parse vcf generated by Platypus to grab read balance. Note that this is hard-coded to Platypus (version 0.8.1) and will not generalize to vcfs generated with other programs (and, potentially, other versions of Platypus)

Parameters:

site_qual : int

Minimum (PHRED) site quality at which sites should be included

genotype_qual : int

Minimum (PHRED) genotype quality at which sites should be included

depth : int

Minimum depth at which sites should be included

chrom : str

Name of the chromosome to include

Returns:

tuple

five corresponding arrays of the same length:

(position across the chromosome, site quality, read balance, genotype quality, and depth)

plot_variants_per_chrom(chrom_list, sampleID, output_prefix, site_qual, genotype_qual, depth, MarkerSize, MarkerAlpha, bamfile_obj, variant_caller, homogenize, dataframe_out, min_count, window_size, x_scale=1000000, target_file=None, include_fixed=False)[source]

Parses a vcf file and plots read balance in separate plots for each chromosome in the input list

Parameters:

chrom_list : list

Chromosomes to include

sampleID : str

Sample ID (for plot titles)

output_prefix : str

Full path to and prefix of desired output plots

site_qual : int

Minimum (PHRED) site quality at which sites should be included

genotype_qual : int

Minimum (PHRED) genotype quality at which sites should be included

depth : int

Minimum depth at which sites should be included

MarkerSize : float

Size of markers (matplotlib sizes) to use in the figure

MarkerAlpha : float

Transparency (matplotlib values, 0 to 1) of markers

bamfile_obj : BamFile() object

Used to get chromosome lengths only

variant_caller : str

Variant caller used to generate vcf - currently only “platypus” supported

homogenize: bool

If True, all read balance values less than 0.5 will be transformed by subtracting the value from 1. For example, the values 0.25 and 0.75 would be treated as equivalent.

dataframe_out : str

Full path of file to write pandas dataframe to. Will overwire if exists

min_count : int

Minimum number of variants to include a window for plotting.

window_size

If int, the window size to use for sliding window analyses, if None intervals from target_file

x_scale : int

Divide all x values (including Xlim) by this value. Default is 1000000 (1MB)

target_file : str

Path to bed_file containing regions to analyze instead of windows of a fixed size. Will only be engaged if window_size is None

include_fixed : bool

If False, only plots histogram for values between 0.05 and 1.0. If True, plots histogram of all variants.

Returns:

int

0 if variants to analyze; 1 if no variants to analyze on any chromosome

xyalign.variants.read_balance_per_window(chrom, positions, readBalance, sampleID, homogenize, chr_len, window_size, target_file=None)[source]

Calculates mean read balance per genomic window (defined by size or an external target bed file) for a given chromosome. Takes as input an array of positions and an array of read balances - the order of which must correspond exactly. In addition, the positions are expected to ALL BE ON THE SAME CHROMOSOME and be in numerically sorted order (i.e., the output of parse_platypus_VCF())

Parameters:

chrom : str

Name of the chromosome

positions : numpy array

Positions along the chromosome (same length as readBalance)

readBalance : numpy array

Read balance corresponding with the positions in the positions array

sampleID : str

Sample name or id to include in the plot title

homogenize: bool

If True, all read balance values less than 0.5 will be transformed by subtracting the value from 1. For example, the values 0.25 and 0.75 would be treated as equivalent.

chr_len : int

Length of chromosome. Ignored if target_file is provided.

window_size

If int, the window size to use for sliding window analyses, if None intervals from target_file

target_file : str

Path to bed file containing regions to analyze instead of windows of a fixed size. Will only be engaged if window_size is None

Returns:

pandas dataframe

With columns: “chrom”, “start”, “stop”, “balance”, and “count”

xyalign.variants.plot_read_balance(chrom, positions, readBalance, sampleID, output_prefix, MarkerSize, MarkerAlpha, homogenize, chrom_len, x_scale=1000000)[source]

Plots read balance at each SNP along a chromosome

Parameters:

chrom : str

Name of the chromosome

positions : numpy array

Positions along the chromosome (same length as readBalance)

readBalance : numpy array

Read balance corresponding with the positions in the positions array

sampleID : str

Sample name or id to include in the plot title

output_prefix : str

Desired prefix (including full path) of the output files

MarkerSize : float

Size of markers (matplotlib sizes) to use in the figure

MarkerAlpha : float

Transparency (matplotlib values) of markers for the figure

homogenize: bool

If True, all read balance values less than 0.5 will be transformed by subtracting the value from 1. For example, the values 0.25 and 0.75 would be treated as equivalent.

chrom_len : int

Length of chromosome

x_scale : int

Divide all x values (including Xlim) by this value. Default is 1000000 (1MB)

Returns:

int

0

xyalign.variants.hist_read_balance(chrom, readBalance, sampleID, homogenize, output_prefix, include_fixed=False)[source]

Plots a histogram of read balance values between 0.05 and 1.0 (non-incusive)

Parameters:

chrom : str

Name of the chromosome

readBalance : list or numpy array

Read balance values

sampleID : str

Sample name or id to include in the plot title

homogenize: bool

If True, all read balance values less than 0.5 will be transformed by subtracting the value from 1. For example, the values 0.25 and 0.75 would be treated as equivalent.

output_prefix : str

Desired prefix (including full path) of the output files

include_fixed : bool

If False, only plots histogram for values between 0.05 and 1.0. If True, plots histogram of all variants.

Returns:

int

0 if plotting successful, 1 otherwise.