xyalign.xyalign module

xyalign.xyalign.parse_args()[source]

Parse command-line arguments

Returns:Parser argument namespace
xyalign.xyalign.ref_prep(ref_obj, ref_mask, ref_dir, xx, xy, y_chromosome, samtools_path, bwa_path, bwa_index)[source]

Reference prep part of XYalign pipeline.

  • Creates two reference fasta files. Both will include masks provied with

ref_mask. One will additionally have the entire Y chromosome hard masked.

  • Indexes (.fai, .dict, and optionally bwa indices) both new references
Parameters:

ref_obj : RefFasta() object

A reftools.RefFasta() object of a fasta reference file to be processed

ref_mask : list or None

List of files to use to hard-mask references. None will ignore masking.

ref_dir : str

Path to output directory

xx : str

Path to XX output reference

xy : str

Path to XY output reference

y_chromosome : str

Name of Y chromosome in fasta

samtools_path : str

The path to samtools (i.e, “samtools” if in path)

bwa_path : str

The path to bwa (i.e, “bwa” if in path)

bwa_index : bool

If True, create bwa indices. Don’t if False.

Returns:

tuple

Paths to two masked references (y_masked, y_unmasked)

xyalign.xyalign.chrom_stats(bam_obj_list, chrom_list, use_counts)[source]

Runs chrom stats module.

Calculates mean depth and mapq across entire scaffolds for a list of bam files

Returns:

tuple

Tuple containing two dictionaries with results for depth and mapq, respectively. Or, if use_counts is True, returns a tuple containing the count dictionary and None.

xyalign.xyalign.bam_analysis(input_bam_obj, platypus_calling, platypus_path, vcf_log, ref_obj, input_chroms, cpus, out_vcf, no_variant_plots, window_size, target_bed, sample_id, readbalance_prefix, variant_site_quality, variant_genotype_quality, variant_depth, marker_size, marker_transparency, homogenize_read_balance, data_frame_readbalance, min_variant_count, no_bam_analysis, ignore_duplicates, exact_depth, whole_genome_threshold, mapq_cutoff, min_depth_filter, max_depth_filter, depth_mapq_prefix, bam_data_frame, output_bed_high, output_bed_low, use_bed_for_platypus, coordinate_scale, fixed)[source]

Runs bam analyis part of XYalign pipeline on bam file.

  • (Optionally) calls variants using Platypus
  • (Optionally) parses and filters Platypus vcf, and plots read balance
  • (Optionally) Calculates window based metrics from the bam file:
    depth and mapq
  • (optionally) Plots window-based metrics
  • Outputs two bed files: high quality windows, and low quality windows.
Parameters:

input_bam_obj : bam.BamFile() object

platypus_calling : bool

If True, will call and analyze variants

platypus_path : str

Command to call platypus (e.g, “platypus”)

vcf_log : str

Path to file for platypus log

ref_obj : reftools.RefFasta() object

input_chroms : list

Chromosomes to analyze

cpus : int

Number of threads/cpus

out_vcf : str

Output vcf path/name

no_variant_plots : bool

If True, will not plot read balance

window_size : int or None

Window size for sliding window analyses (both bam and vcf). If None, will use regions in target_bed

target_bed : str or None

Path to bed file containing targets to use in sliding window analyses

sample_id : str

readbalance_prefix : str

Prefix, including full path, to use for output files for readbalance analyses

variant_site_quality : int

Minimum site quality (PHRED) for a site to be included in readbalance analyses

variant_genotype_quality : int

Minimum genotype quality for a site to be included in read balance analyses

variant_depth : int

Minimum depth for a site to be included in read balance analyses

marker_size : float

Marker size for plotting genome scatter plots

marker_transparency: float

Value to use for marker transparency in genome scatter plots

homogenize_read_balance : bool

If true, will subtract values less than 0.5 from 1. I.e., 0.25 and 0.75 would be treated equivalently

data_frame_readbalance: str

Path of output file for full read balance dataframe

min_variant_count : int

Minimum number of variants in a given window for the window to be plotted in window-based read balance analyses

no_bam_analysis : bool

If True, no bam analyses will take place

ignore_duplicates : bool

If True, duplicates excluded from bam analyses

exact_depth : bool

If True, exact depth calculated in each window. Else, a much faster approximation will be used

whole_genome_threshold : bool

If True, values for depth filters will be calculated using mean from across all chromosomes included in analyses. Else, mean will be taken per chromosome

min_depth_filter : float

Minimum depth threshold for a window to be considered high. Calculated as mean depth * min_depth_filter.

max_depth_filter : float

Maximum depth threshold for a window to be considered high. Calculated as mean depth * min_depth_filter.

depth_mapq_prefix : str

Prefix, including full path, to be used for files output from depth and mapq analyses

bam_data_frame : str

Full path to output file for dataframe containing all data from bam analyses

output_bed_high : str

Full path to output bed containing high quality (i.e., passing filters) windows

output_bed_low : str

Full path to output bed containing low quality (i.e., failing filters) windows

use_bed_for_platypus : bool

If True, use output_bed_high as regions for Platypus calling

coordinate_scale : int

Divide all coordinates by this value for plotting. In most cases, 1000000 will be ideal for eukaryotic genomes.

fixed : bool

If False, only plots histogram for values between 0.05 and 1.0 (non-inclusive). If True, plots histogram of all variants.

Returns:

tuple

(list of pandas dataframes with passing windows, list of pandas dataframes with failing windows)

xyalign.xyalign.ploidy_analysis(passing_df, failing_df, no_perm_test, no_ks_test, no_bootstrap, input_chroms, x_chromosome, y_chromosome, results_dir, num_permutations, num_bootstraps, sample_id)[source]

Runs the ploidy analysis part of XYalign.

  • Runs permutation test to systematically compare means between

every possible pair of chromosomes

  • Runs K-S two sample test to systematically compare distributions between

every possible pair of chromosomes

  • Bootstraps the mean depth ratio for every possible pair of chromosomes
Parameters:

passing_df : list

Passing pandas dataframes, one per chromosome

failing_df : list

Failing pandas dataframes, one per chromosome

no_perm_test : bool

If False, permutation test will be run

no_ks_test : bool

If False, KS test will be run

no_bootstrap : bool

If False, bootstrap analysis will be run

input_chroms : list

Chromosomes/scaffolds to analyze

x_chromosome : list

X-linked scaffolds

y_chromosome : list

Y-likned scaffolds

results_dir : str

Full path to directory to output results

num_permutations : int

Number of permutations

num_bootstraps : int

Number of bootstrap replicates

sample_id : str

Returns:

dictionary

Results for each test. Keys: perm, ks, boot.

xyalign.xyalign.remapping(input_bam_obj, y_pres, masked_references, samtools_path, sambamba_path, repairsh_path, shufflesh_path, bwa_path, bwa_flags, single_end, bam_dir, fastq_dir, sample_id, x_chromosome, y_chromosome, cpus, xmx, fastq_compression, cleanup, read_group_id)[source]

Runs remapping steps of XYalign.

  • Strips, sorts, and re-pair reads from the sex chromosomes (collecting read

group information)

  • Maps (with sorting) reads (with read group information) to appropriate

reference based on presence (or not) of Y chromosome

  • Merge bam files (if more than one read group)
Parameters:

input_bam_obj : bam.BamFile() object

y_pres : bool

True if Y chromosome present in individual

masked_references : tuple

Masked reference objects (xx, xy)

samtools_path : str

Path/command to call samtools

sambamba_path : str

Path/command to call sambamba

repairsh_path : str

Path/command to call repair.sh

shufflesh_path : str

Path/command to call shuffle.sh

bwa_path : str

Path/command to call bwa

bwa_flags : str

Flags to use for bwa mapping

single_end : bool

If True, reads treated as single end

bam_dir : str

Path to output directory for bam files

fastq_dir : str

Path to output directory for fastq files

sample_id : str

x_chromosome : list

X-linked scaffolds

y_chromosome : list

Y-linked scaffolds

cpus : int

Number of threads/cpus

xmx : str

Value to be combined with -Xmx for java programs (i.e., 4g would result in -Xmx4g)

fastq_compression : int

Compression level for fastq files. 0 leaves fastq files uncompressed. Otherwise values should be between 1 and 9 (inclusive), with larger values indicating more compression

cleanup : bool

If True, will delete temporary files

read_group_id : str

ID to use to add read group information

Returns:

str

Path to bam containing remapped sex chromsomes

xyalign.xyalign.swap_sex_chroms(input_bam_obj, new_bam_obj, samtools_path, sambamba_path, x_chromosome, y_chromosome, bam_dir, sample_id, cpus, xyalign_params)[source]

Switches sex chromosmes from new_bam_file with those in original bam file

Parameters:

input_bam_obj : bam.BamFile() object

Original input bam file object

new_bam_obj : bam.BamFile() object

Bam file object containing newly mapped sex chromosomes (to insert)

samtools_path : str

Path/command to call samtools

sambamba_path : str

Path/command to call sambamba

x_chromosome : list

X-linked scaffolds

y_chromosome : str

Y-linked scaffolds

bam_dir : str

Path to bam output directory

sample_id : str

cpus : int

Number of threads/cpus

xyalign_params : dict

Dictionary of xyalign_params to add to bam header

Returns:

str

Path to new bam file containing original autosomes and new sex chromosomes

xyalign.xyalign.main()[source]