xyalign.bam module

class xyalign.bam.BamFile(filepath, samtools='samtools', no_initial_index=False)[source]

A class for working with external bam files

Attributes

filepath (str) Full path to external bam file.
samtools (str) Full path to samtools. Default = ‘samtools’
is_indexed()[source]

Checks that bam index exists, is not empty, and is newer than bam.

Returns:

bool

True if bam index exists and is newer than bam, False otherwise.

index_bam()[source]

Indexes a bam using samtools (‘samtools index file.bam’).

Returns:

bool

True if successful.

Raises:

RuntimeError

If return code from external call is not 0.

get_chrom_length(chrom)[source]

Extract chromosome length from BAM header.

Parameters:

chrom : str

The name of the chromosome or scaffold.

Returns:

length : int

The length (integer) of the chromsome/scaffold

Raises:

RuntimeError

If chromosome name not present in bam header

chromosome_lengths()[source]
Returns:

tuple

chromosome lengths ordered by sequence order in bam header

chromosome_names()[source]
Returns:

tuple

chromosome names ordered by sequence order in bam header

chromosome_bed(output_file, chromosome_list)[source]

Takes list of chromosomes and outputs a bed file with the length of each chromosome on each line (e.g., chr1 0 247249719).

Parameters:

output_file : str

Name of (including full path to) desired output file

chromosome_list : list

Chromosome/scaffolds to include

Returns:

str

output_file

Raises:

RuntimeError

If chromosome name is not in bam header.

check_chrom_in_bam(chromosome_list)[source]

Checks to see if all chromosomes in chromosome_list are in bam file

Parameters:

chromosome_list : list

Chromosomes/scaffolds to check

Returns:

list

List of chromosomes not in bam file

sort_bam(sorted_bam, query_name=False)[source]

Sorts bam file by coordinate (query_name=False) or query name (query_name=True)

Parameters:

sorted_bam : str

Full path to (including desired name of) output bam file

query_name : bool

If True, sort by query name (read ID), else sort by coordinate

Returns:

BamFile() object

BamFile() object of new, sorted bam file

extract_regions(regions, new_bam, rg_id=None)[source]

Extracts regions from a bam file into new bam file.

Parameters:

regions : list

regions from which reads will be stripped

new_bam : str

Full path to and desired name of output bam file

rg_id : str or None

Path to text file containing read group ids to use when isolating regions. If None, all reads from regions will be extracted.

Returns:

BamFile() object

BamFile() object of new bam file (containing extracted regions)

extract_read_group(new_bam, rg_id)[source]

Extracts all reads belonging to a given RG ID from a bam file into new bam file.

Parameters:

new_bam : str

Full path to and desired name of output bam file

rg_id : str

Path to text file containing read group ids to use when isolating regions.

Returns:

BamFile() object

BamFile() object of new bam file (containing extracted regions)

strip_reads(repairsh, shufflesh, single, output_directory, output_prefix, regions, repair_xmx, compression, cleanup=True, default_rg='None')[source]

Strips reads from a bam or cram file in provided regions and outputs sorted fastqs containing reads, one set of fastq files per read group.

Parameters:

repairsh : str

Path to repair.sh (from BBmap)

shufflesh : str

Path to shuffle.sh (from BBmap)

single : bool

If true output single-end fastq, otherwise output paired-end fastqs

output_directory : str

The directory for ALL output (including temporary files)

output_prefix : str

The name (without path) to use for prefix to output fastqs

regions : list

regions from which reads will be stripped

repair_xmx : str

If “None”, repair.sh will allocate its own memory. Otherwise value will be provided in the form of -Xmx4g, where 4g is the value provided as repair_xmx

compression : int

Desired compression level (0-9) for output fastqs. If 0, fastqs will be uncompressed.

cleanup : bool

If true, will clean up temporary files.

default_rg : str

If “None”, no default read group will be created. Otherwise, default read group will be string provided. This read group will consist exclusively of an ID.

Returns:

list

A two-item list containing the path to a text file pairing read group names with associated output fastqs, and a text file containing a list of @RG lines associated with each read group

analyze_bam(chrom, duplicates, exact, window_size, target_file=None)[source]

Analyze BAM (or CRAM) file for depth and mapping quality across genomic windows.

Parameters:

chrom : str

The name of the chromosome to analyze

duplicates : bool

If True, duplicates included in analyses.

exact : bool

If True, mean depth is calculated exactly within each window. If False, an accurate (and much faster) approximation is used

window_size

If int, the window size to use for sliding window analyses, if None intervals from target_file

target_file : str

Path to bed_file containing regions to analyze instead of windows of a fixed size. Will only be engaged if window_size is None

Returns:

pandas dataframe

pandas data frame with the columns: “chrom”, “start”, “stop”, “depth”, “mapq”

chrom_stats(chrom, duplicates)[source]

Analyze BAM (or CRAM) file for depth and mapping quality across a single chromosome.

Parameters:

chrom : str

The name of the chromosome to analyze

duplicates : bool

If True, duplicates included in analyses.

Returns:

tuple

(mean_depth, mean_mapq)

chrom_counts()[source]

Get read counts per chrom from a bamfile

platypus_caller(platypus_path, log_path, ref, chroms, cpus, output_file, regions_file=None)[source]

Uses platypus to make variant calls on provided bam file

Parameters:

platypus_path : str

Path to platypus

log_path : str

Path to and name of desired log file for platypus

ref : str

Path to reference sequence

chroms : list

Chromosomes to call variants on, e.g., [“chrX”, “chrY”, “chr19”]

cpus : int

Number of threads/cores to use

output_file : path

Path to and name of the output vcf

regions_file : {str, None}

If not None, must be path to bed file containing regions to call variants in. If None, calls in call regions of provided chromosomes. Default = None.

Returns:

int

Exit code of the platypus call

xyalign.bam.switch_sex_chromosomes_sambamba(samtools_path, sambamba_path, bam_orig, bam_new, sex_chroms, output_directory, output_prefix, threads, pg_header_dict, cram=False)[source]

Removes sex chromosomes from original bam and merges in remmapped sex chromosomes, while retaining the original bam header (and adding new @PG line)

Parameters:

samtools_path : str

The path to samtools

sambamba_path :

The path to sambamba

bam_orig : str

The path to the original full bam file

bam_new : str

The path to the bam file containing the remapped sex chromosomes

sex_chroms : list

Sex chromosomes (to be removed from bam_orig)

output_directory : str

The path to directory where all files (inc. temp) will be output

output_prefix : str

The name (without path) to use as prefix for all files

threads : int

The number of threads/cpus to use

pg_header_dict : dict

dictionary with information to be included in the new @PG line
  • must contain:
    Key = ‘CL’, value = list of command line values Key = ‘ID’, value = string of program ID
  • optional:
    Key = ‘VN’, value = string of program version

cram : bool

If True, will treat input as cram files and output cram files. Otherwise, will treat input as bam. Defaule is False. True is currently unsupported.

Returns:

str

Bam or cram file path with new sex chromosomes, but all others intact.

Raises:

RuntimeError

If cram is not False.

xyalign.bam.samtools_merge(samtools_path, bam_list, output_prefix, threads)[source]

Merges bam files using samtools.

Parameters:

samtools_path : str

The path to samtools

bam_list : list

Bam files to be merged. Merging order will match order of this list.

output_prefix : str

Returns:

str

path to merged bam