xyalign.bam module¶
-
class
xyalign.bam.
BamFile
(filepath, samtools='samtools', no_initial_index=False)[source]¶ A class for working with external bam files
Attributes
filepath (str) Full path to external bam file. samtools (str) Full path to samtools. Default = ‘samtools’ -
is_indexed
()[source]¶ Checks that bam index exists, is not empty, and is newer than bam.
Returns: bool
True if bam index exists and is newer than bam, False otherwise.
-
index_bam
()[source]¶ Indexes a bam using samtools (‘samtools index file.bam’).
Returns: bool
True if successful.
Raises: RuntimeError
If return code from external call is not 0.
-
get_chrom_length
(chrom)[source]¶ Extract chromosome length from BAM header.
Parameters: chrom : str
The name of the chromosome or scaffold.
Returns: length : int
The length (integer) of the chromsome/scaffold
Raises: RuntimeError
If chromosome name not present in bam header
-
chromosome_lengths
()[source]¶ Returns: tuple
chromosome lengths ordered by sequence order in bam header
-
chromosome_bed
(output_file, chromosome_list)[source]¶ Takes list of chromosomes and outputs a bed file with the length of each chromosome on each line (e.g., chr1 0 247249719).
Parameters: output_file : str
Name of (including full path to) desired output file
chromosome_list : list
Chromosome/scaffolds to include
Returns: str
output_file
Raises: RuntimeError
If chromosome name is not in bam header.
-
check_chrom_in_bam
(chromosome_list)[source]¶ Checks to see if all chromosomes in chromosome_list are in bam file
Parameters: chromosome_list : list
Chromosomes/scaffolds to check
Returns: list
List of chromosomes not in bam file
-
sort_bam
(sorted_bam, query_name=False)[source]¶ Sorts bam file by coordinate (query_name=False) or query name (query_name=True)
Parameters: sorted_bam : str
Full path to (including desired name of) output bam file
query_name : bool
If True, sort by query name (read ID), else sort by coordinate
Returns: BamFile() object
BamFile() object of new, sorted bam file
-
extract_regions
(regions, new_bam, rg_id=None)[source]¶ Extracts regions from a bam file into new bam file.
Parameters: regions : list
regions from which reads will be stripped
new_bam : str
Full path to and desired name of output bam file
rg_id : str or None
Path to text file containing read group ids to use when isolating regions. If None, all reads from regions will be extracted.
Returns: BamFile() object
BamFile() object of new bam file (containing extracted regions)
-
extract_read_group
(new_bam, rg_id)[source]¶ Extracts all reads belonging to a given RG ID from a bam file into new bam file.
Parameters: new_bam : str
Full path to and desired name of output bam file
rg_id : str
Path to text file containing read group ids to use when isolating regions.
Returns: BamFile() object
BamFile() object of new bam file (containing extracted regions)
-
strip_reads
(repairsh, shufflesh, single, output_directory, output_prefix, regions, repair_xmx, compression, cleanup=True, default_rg='None')[source]¶ Strips reads from a bam or cram file in provided regions and outputs sorted fastqs containing reads, one set of fastq files per read group.
Parameters: repairsh : str
Path to repair.sh (from BBmap)
shufflesh : str
Path to shuffle.sh (from BBmap)
single : bool
If true output single-end fastq, otherwise output paired-end fastqs
output_directory : str
The directory for ALL output (including temporary files)
output_prefix : str
The name (without path) to use for prefix to output fastqs
regions : list
regions from which reads will be stripped
repair_xmx : str
If “None”, repair.sh will allocate its own memory. Otherwise value will be provided in the form of -Xmx4g, where 4g is the value provided as repair_xmx
compression : int
Desired compression level (0-9) for output fastqs. If 0, fastqs will be uncompressed.
cleanup : bool
If true, will clean up temporary files.
default_rg : str
If “None”, no default read group will be created. Otherwise, default read group will be string provided. This read group will consist exclusively of an ID.
Returns: list
A two-item list containing the path to a text file pairing read group names with associated output fastqs, and a text file containing a list of @RG lines associated with each read group
-
analyze_bam
(chrom, duplicates, exact, window_size, target_file=None)[source]¶ Analyze BAM (or CRAM) file for depth and mapping quality across genomic windows.
Parameters: chrom : str
The name of the chromosome to analyze
duplicates : bool
If True, duplicates included in analyses.
exact : bool
If True, mean depth is calculated exactly within each window. If False, an accurate (and much faster) approximation is used
window_size
If int, the window size to use for sliding window analyses, if None intervals from target_file
target_file : str
Path to bed_file containing regions to analyze instead of windows of a fixed size. Will only be engaged if window_size is None
Returns: pandas dataframe
pandas data frame with the columns: “chrom”, “start”, “stop”, “depth”, “mapq”
-
chrom_stats
(chrom, duplicates)[source]¶ Analyze BAM (or CRAM) file for depth and mapping quality across a single chromosome.
Parameters: chrom : str
The name of the chromosome to analyze
duplicates : bool
If True, duplicates included in analyses.
Returns: tuple
(mean_depth, mean_mapq)
-
platypus_caller
(platypus_path, log_path, ref, chroms, cpus, output_file, regions_file=None)[source]¶ Uses platypus to make variant calls on provided bam file
Parameters: platypus_path : str
Path to platypus
log_path : str
Path to and name of desired log file for platypus
ref : str
Path to reference sequence
chroms : list
Chromosomes to call variants on, e.g., [“chrX”, “chrY”, “chr19”]
cpus : int
Number of threads/cores to use
output_file : path
Path to and name of the output vcf
regions_file : {str, None}
If not None, must be path to bed file containing regions to call variants in. If None, calls in call regions of provided chromosomes. Default = None.
Returns: int
Exit code of the platypus call
-
-
xyalign.bam.
switch_sex_chromosomes_sambamba
(samtools_path, sambamba_path, bam_orig, bam_new, sex_chroms, output_directory, output_prefix, threads, pg_header_dict, cram=False)[source]¶ Removes sex chromosomes from original bam and merges in remmapped sex chromosomes, while retaining the original bam header (and adding new @PG line)
Parameters: samtools_path : str
The path to samtools
sambamba_path :
The path to sambamba
bam_orig : str
The path to the original full bam file
bam_new : str
The path to the bam file containing the remapped sex chromosomes
sex_chroms : list
Sex chromosomes (to be removed from bam_orig)
output_directory : str
The path to directory where all files (inc. temp) will be output
output_prefix : str
The name (without path) to use as prefix for all files
threads : int
The number of threads/cpus to use
pg_header_dict : dict
- dictionary with information to be included in the new @PG line
- must contain:
- Key = ‘CL’, value = list of command line values Key = ‘ID’, value = string of program ID
- optional:
- Key = ‘VN’, value = string of program version
cram : bool
If True, will treat input as cram files and output cram files. Otherwise, will treat input as bam. Defaule is False. True is currently unsupported.
Returns: str
Bam or cram file path with new sex chromosomes, but all others intact.
Raises: RuntimeError
If cram is not False.
-
xyalign.bam.
samtools_merge
(samtools_path, bam_list, output_prefix, threads)[source]¶ Merges bam files using samtools.
Parameters: samtools_path : str
The path to samtools
bam_list : list
Bam files to be merged. Merging order will match order of this list.
output_prefix : str
Returns: str
path to merged bam