xyalign.xyalign module¶

xyalign.xyalign.parse_args()[source]¶

Parse command-line arguments

Returns:	Parser argument namespace

xyalign.xyalign.ref_prep(ref_obj, ref_mask, ref_dir, xx, xy, y_chromosome, samtools_path, bwa_path, bwa_index)[source]¶

Reference prep part of XYalign pipeline.

Creates two reference fasta files. Both will include masks provied with

ref_mask. One will additionally have the entire Y chromosome hard masked.

Indexes (.fai, .dict, and optionally bwa indices) both new references

Parameters:

ref_obj : RefFasta() object

A reftools.RefFasta() object of a fasta reference file to be processed

ref_mask : list or None

List of files to use to hard-mask references. None will ignore masking.

ref_dir : str

Path to output directory

xx : str

Path to XX output reference

xy : str

Path to XY output reference

y_chromosome : str

Name of Y chromosome in fasta

samtools_path : str

The path to samtools (i.e, “samtools” if in path)

bwa_path : str

The path to bwa (i.e, “bwa” if in path)

bwa_index : bool

If True, create bwa indices. Don’t if False.

Returns:

tuple

Paths to two masked references (y_masked, y_unmasked)

xyalign.xyalign.chrom_stats(bam_obj_list, chrom_list, use_counts)[source]¶

Runs chrom stats module.

Calculates mean depth and mapq across entire scaffolds for a list of bam files

Returns:

tuple

Tuple containing two dictionaries with results for depth and mapq, respectively. Or, if use_counts is True, returns a tuple containing the count dictionary and None.

xyalign.xyalign.bam_analysis(input_bam_obj, platypus_calling, platypus_path, vcf_log, ref_obj, input_chroms, cpus, out_vcf, no_variant_plots, window_size, target_bed, sample_id, readbalance_prefix, variant_site_quality, variant_genotype_quality, variant_depth, marker_size, marker_transparency, homogenize_read_balance, data_frame_readbalance, min_variant_count, no_bam_analysis, ignore_duplicates, exact_depth, whole_genome_threshold, mapq_cutoff, min_depth_filter, max_depth_filter, depth_mapq_prefix, bam_data_frame, output_bed_high, output_bed_low, use_bed_for_platypus, coordinate_scale, fixed)[source]¶

Runs bam analyis part of XYalign pipeline on bam file.

(Optionally) calls variants using Platypus
(Optionally) parses and filters Platypus vcf, and plots read balance
(Optionally) Calculates window based metrics from the bam file:

depth and mapq
(optionally) Plots window-based metrics
Outputs two bed files: high quality windows, and low quality windows.

Parameters:

input_bam_obj : bam.BamFile() object

platypus_calling : bool

If True, will call and analyze variants

platypus_path : str

Command to call platypus (e.g, “platypus”)

vcf_log : str

Path to file for platypus log

ref_obj : reftools.RefFasta() object

input_chroms : list

Chromosomes to analyze

cpus : int

Number of threads/cpus

out_vcf : str

Output vcf path/name

no_variant_plots : bool

If True, will not plot read balance

window_size : int or None

Window size for sliding window analyses (both bam and vcf). If None, will use regions in target_bed

target_bed : str or None

Path to bed file containing targets to use in sliding window analyses

sample_id : str

readbalance_prefix : str

Prefix, including full path, to use for output files for readbalance analyses

variant_site_quality : int

Minimum site quality (PHRED) for a site to be included in readbalance analyses

variant_genotype_quality : int

Minimum genotype quality for a site to be included in read balance analyses

variant_depth : int

Minimum depth for a site to be included in read balance analyses

marker_size : float

Marker size for plotting genome scatter plots

marker_transparency: float

Value to use for marker transparency in genome scatter plots

homogenize_read_balance : bool

If true, will subtract values less than 0.5 from 1. I.e., 0.25 and 0.75 would be treated equivalently

data_frame_readbalance: str

Path of output file for full read balance dataframe

min_variant_count : int

Minimum number of variants in a given window for the window to be plotted in window-based read balance analyses

no_bam_analysis : bool

If True, no bam analyses will take place

ignore_duplicates : bool

If True, duplicates excluded from bam analyses

exact_depth : bool

If True, exact depth calculated in each window. Else, a much faster approximation will be used

whole_genome_threshold : bool

If True, values for depth filters will be calculated using mean from across all chromosomes included in analyses. Else, mean will be taken per chromosome

min_depth_filter : float

Minimum depth threshold for a window to be considered high. Calculated as mean depth * min_depth_filter.

max_depth_filter : float

Maximum depth threshold for a window to be considered high. Calculated as mean depth * min_depth_filter.

depth_mapq_prefix : str

Prefix, including full path, to be used for files output from depth and mapq analyses

bam_data_frame : str

Full path to output file for dataframe containing all data from bam analyses

output_bed_high : str

Full path to output bed containing high quality (i.e., passing filters) windows

output_bed_low : str

Full path to output bed containing low quality (i.e., failing filters) windows

use_bed_for_platypus : bool

If True, use output_bed_high as regions for Platypus calling

coordinate_scale : int

Divide all coordinates by this value for plotting. In most cases, 1000000 will be ideal for eukaryotic genomes.

fixed : bool

If False, only plots histogram for values between 0.05 and 1.0 (non-inclusive). If True, plots histogram of all variants.

Returns:

tuple

(list of pandas dataframes with passing windows, list of pandas dataframes with failing windows)

xyalign.xyalign.ploidy_analysis(passing_df, failing_df, no_perm_test, no_ks_test, no_bootstrap, input_chroms, x_chromosome, y_chromosome, results_dir, num_permutations, num_bootstraps, sample_id)[source]¶

Runs the ploidy analysis part of XYalign.

Runs permutation test to systematically compare means between

every possible pair of chromosomes

Runs K-S two sample test to systematically compare distributions between

every possible pair of chromosomes

Bootstraps the mean depth ratio for every possible pair of chromosomes

Parameters:

passing_df : list

Passing pandas dataframes, one per chromosome

failing_df : list

Failing pandas dataframes, one per chromosome

no_perm_test : bool

If False, permutation test will be run

no_ks_test : bool

If False, KS test will be run

no_bootstrap : bool

If False, bootstrap analysis will be run

input_chroms : list

Chromosomes/scaffolds to analyze

x_chromosome : list

X-linked scaffolds

y_chromosome : list

Y-likned scaffolds

results_dir : str

Full path to directory to output results

num_permutations : int

Number of permutations

num_bootstraps : int

Number of bootstrap replicates

sample_id : str

Returns:

dictionary

Results for each test. Keys: perm, ks, boot.

xyalign.xyalign.remapping(input_bam_obj, y_pres, masked_references, samtools_path, sambamba_path, repairsh_path, shufflesh_path, bwa_path, bwa_flags, single_end, bam_dir, fastq_dir, sample_id, x_chromosome, y_chromosome, cpus, xmx, fastq_compression, cleanup, read_group_id)[source]¶

Runs remapping steps of XYalign.

Strips, sorts, and re-pair reads from the sex chromosomes (collecting read

group information)

Maps (with sorting) reads (with read group information) to appropriate

reference based on presence (or not) of Y chromosome

Merge bam files (if more than one read group)

Parameters:

input_bam_obj : bam.BamFile() object

y_pres : bool

True if Y chromosome present in individual

masked_references : tuple

Masked reference objects (xx, xy)

samtools_path : str

Path/command to call samtools

sambamba_path : str

Path/command to call sambamba

repairsh_path : str

Path/command to call repair.sh

shufflesh_path : str

Path/command to call shuffle.sh

bwa_path : str

Path/command to call bwa

bwa_flags : str

Flags to use for bwa mapping

single_end : bool

If True, reads treated as single end

bam_dir : str

Path to output directory for bam files

fastq_dir : str

Path to output directory for fastq files

sample_id : str

x_chromosome : list

X-linked scaffolds

y_chromosome : list

Y-linked scaffolds

cpus : int

Number of threads/cpus

xmx : str

Value to be combined with -Xmx for java programs (i.e., 4g would result in -Xmx4g)

fastq_compression : int

Compression level for fastq files. 0 leaves fastq files uncompressed. Otherwise values should be between 1 and 9 (inclusive), with larger values indicating more compression

cleanup : bool

If True, will delete temporary files

read_group_id : str

ID to use to add read group information

Returns:

str

Path to bam containing remapped sex chromsomes

xyalign.xyalign.swap_sex_chroms(input_bam_obj, new_bam_obj, samtools_path, sambamba_path, x_chromosome, y_chromosome, bam_dir, sample_id, cpus, xyalign_params)[source]¶

Switches sex chromosmes from new_bam_file with those in original bam file

Parameters:

input_bam_obj : bam.BamFile() object

Original input bam file object

new_bam_obj : bam.BamFile() object

Bam file object containing newly mapped sex chromosomes (to insert)

samtools_path : str

Path/command to call samtools

sambamba_path : str

Path/command to call sambamba

x_chromosome : list

X-linked scaffolds

y_chromosome : str

Y-linked scaffolds

bam_dir : str

Path to bam output directory

sample_id : str

cpus : int

Number of threads/cpus

xyalign_params : dict

Dictionary of xyalign_params to add to bam header

Returns:

str

Path to new bam file containing original autosomes and new sex chromosomes

xyalign.xyalign.main()[source]¶