constants module¶
module responsible for small utility functions and constants used throughout the structural_variant package
-
mavis.constants.
CALL_METHOD
= <vocab.vocab.Vocab object>¶ Vocab
– holds controlled vocabulary for allowed call methodsCONTIG
: a contig was assembled and aligned across the breakpointsSPLIT
: the event was called by split readFLANK
: the event was called by flanking read pairSPAN
: the event was called by spanning read
-
mavis.constants.
CIGAR
= <vocab.vocab.Vocab object>¶ Vocab
– Enum-like. For readable cigar valuesM
: alignment match (can be a sequence match or mismatch)I
: insertion to the referenceD
: deletion from the referenceN
: skipped region from the referenceS
: soft clipping (clipped sequences present in SEQ)H
: hard clipping (clipped sequences NOT present in SEQ)P
: padding (silent deletion from padded reference)EQ
: sequence match (=)X
: sequence mismatch
note: descriptions are taken from the samfile documentation
-
mavis.constants.
COLUMNS
= <vocab.vocab.Vocab object>¶ Vocab
– Column names for i/o files used throughout the pipeline- annotation_figure
- File path to the svg drawing representing the annotation
- annotation_figure_legend
JSON
- JSON data for the figure legend- annotation_id
- Identifier for the annotation step
- break1_chromosome
str
- The name of the chromosome on which breakpoint 1 is situated- break1_ewindow
- Window where evidence was gathered for the first breakpoint
- break1_ewindow_count
int
- Number of reads processed/looked-at in the first evidence window- break1_ewindow_practical_coverage
float
- break2_ewindow_practical_coverage, break1_ewindow_count / len(break1_ewindow). Not the actual coverage as bins are sampled within and there is a read limit cutoff- break1_homologous_seq
str
- Sequence in common at the first breakpoint and other side of the second breakpoint- break1_orientation
ORIENT
- The side of the breakpoint wrt the positive/forward strand that is retained.- break1_position_end
int
- End integer inclusive 1-based of the range representing breakpoint 1- break1_position_start
int
- Start integer inclusive 1-based of the range representing breakpoint 1- break1_seq
str
- The sequence up to and including the breakpoint. Always given wrt to the positive/forward strand- break1_split_reads
int
- Number of split reads that call the exact breakpoint given- break1_split_reads_forced
int
- Number of split reads which were aligned to the opposite breakpoint window using a targeted alignment- break1_strand
STRAND
- The strand wrt to the reference positive/forward strand at this breakpoint.- break2_chromosome
- The name of the chromosome on which breakpoint 2 is situated
- break2_ewindow
- Window where evidence was gathered for the second breakpoint
- break2_ewindow_count
int
- Number of reads processed/looked-at in the second evidence window- break2_ewindow_practical_coverage
float
- break2_ewindow_practical_coverage, break2_ewindow_count / len(break2_ewindow). Not the actual coverage as bins are sampled within and there is a read limit cutoff- break2_homologous_seq
str
- Sequence in common at the second breakpoint and other side of the first breakpoint- break2_orientation
ORIENT
- The side of the breakpoint wrt the positive/forward strand that is retained.- break2_position_end
int
- End integer inclusive 1-based of the range representing breakpoint 2- break2_position_start
int
- Start integer inclusive 1-based of the range representing breakpoint 2- break2_seq
str
- The sequence up to and including the breakpoint. Always given wrt to the positive/forward strand- break2_split_reads
int
- Number of split reads that call the exact breakpoint given- break2_split_reads_forced
int
- Number of split reads which were aligned to the opposite breakpoint window using a targeted alignment- break2_strand
STRAND
- The strand wrt to the reference positive/forward strand at this breakpoint.- call_method
CALL_METHOD
- The method used to call the breakpoints- cdna_synon
- semi-colon delimited list of transcript ids which have an identical cdna sequence to the cdna sequence of the current fusion product
- cluster_id
- Identifier for the merging/clustering step
- cluster_size
- The number of breakpoint pair calls that were grouped in creating the cluster
- contig_alignment_cigar
- The cigar string(s) representing the contig alignment. Semi-colon delimited
- contig_alignment_query_name
- The query name for the contig alignment. Should match the ‘read’ name(s) in the .contigs.bam output file
- contig_alignment_reference_start
- The reference start(s) <chr>:<position> of the contig alignment. Semi-colon delimited
- contig_alignment_score
float
- A rank based on the alignment tool blat etc. of the alignment being used. An average if split alignments were used. Lower numbers indicate a better alignment. If it was the best alignment possible then this would be zero.- contig_build_score
int
- Score representing the edge weights of all edges used in building the sequence- contig_remap_coverage
float
- Fraction of the contig sequence which is covered by the remapped reads- contig_remap_score
float
- Score representing the number of sequences from the set of sequences given to the assembly algorithm that were aligned to the resulting contig with an acceptable scoring based on user-set thresholds. For any sequence its contribution to the score is divided by the number of mappings to give less weight to multimaps- contig_remapped_read_names
- read query names for the reads that were remapped. A -1 or -2 has been appended to the end of the name to indicate if this is the first or second read in the pair
- contig_remapped_reads
int
- the number of reads from the input bam that map to the assembled contig- contig_seq
str
- Sequence of the current contig wrt to the positive forward strand if not strand specific- contig_strand_specific
bool
- A flag to indicate if it was possible to resolve the strand for this contig- contigs_aligned
int
- Number of contigs that were able to align- contigs_assembled
int
- Number of contigs that were built from split read sequences- event_type
SVTYPE
- The classification of the event- flanking_median_fragment_size
int
- The median fragment size of the flanking reads being used as evidence- flanking_pairs
int
- Number of read-pairs where one read aligns to the first breakpoint window and the second read aligns to the other. The count here is based on the number of unique query names- flanking_pairs_compatible
int
- Number of flanking pairs of a compatible orientation type. This applies to insertions and duplications. Flanking pairs supporting an insertion will be compatible to a duplication and flanking pairs supporting a duplication will be compatible to an insertion (possibly indicating an internal translocation)- flanking_stdev_fragment_size
float
- The standard deviation in fragment size of the flanking reads being used as evidence- fusion_cdna_coding_end
- Position wrt the 5’ end of the fusion transcript where coding ends last base of the stop codon
- fusion_cdna_coding_start
- Position wrt the 5’ end of the fusion transcript where coding begins first base of the Met amino acid.
- fusion_mapped_domains
JSON
- List of domains in JSON format where each domain start and end positions are given wrt to the fusion transcript and the mapping quality is the number of matching amino acid positions over the total number of amino acids. The sequence is the amino acid sequence of the domain on the reference/original transcript- fusion_sequence_fasta_file
- Path to the corresponding fasta output file
- fusion_sequence_fasta_id
- The sequence identifier for the cdna sequence output fasta file
- fusion_splicing_pattern
SPLICE_TYPE
- Type of splicing pattern used to create the fusion cDNA.- gene1
- Gene for the current annotation at the first breakpoint
- gene1_aliases
- Other gene names associated with the current annotation at the first breakpoint
- gene1_direction
PRIME
- The direction/prime of the gene- gene2
- Gene for the current annotation at the second breakpoint
- gene2_aliases
- Other gene names associated with the current annotation at the second breakpoint
- gene2_direction
PRIME
- The direction/prime of the gene. Has the following possible values- gene_product_type
GENE_PRODUCT_TYPE
- Describes if the putative fusion product will be sense or anti-sense- genes_encompassed
- Applies to intrachromosomal events only. List of genes which overlap any region that occurs between both breakpoints. For example in a deletion event these would be deleted genes.
- genes_overlapping_break1
- list of genes which overlap the first breakpoint
- genes_overlapping_break2
- list of genes which overlap the second breakpoint
- genes_proximal_to_break1
- list of genes near the breakpoint and the distance away from the breakpoint
- genes_proximal_to_break2
- list of genes near the breakpoint and the distance away from the breakpoint
- library
- Identifier for the library/source
- linking_split_reads
int
- Number of split reads that align to both breakpoints- opposing_strands
bool
- Specifies if breakpoints are on opposite strands wrt to the reference. Expects a boolean- pairing
- A semi colon delimited of event identifiers i.e. <annotation_id>_<splicing pattern>_<cds start>_<cds end>
- product_id
- Unique identifier of the final fusion including splicing and ORF decision from the annotation step
- protein_synon
- semi-colon delimited list of transcript ids which produce a translation with an identical amino-acid sequence to the current fusion product
- protocol
PROTOCOL
- Specifies the type of library- raw_break1_split_reads
int
- Number of split reads before calling the breakpoint- raw_break2_split_reads
int
- Number of split reads before calling the breakpoint- raw_flanking_pairs
int
- Number of flanking reads before calling the breakpoint. The count here is based on the number of unique query names- raw_spanning_reads
int
- Number of spanning reads collected during evidence collection before calling the breakpoint- spanning_read_names
- read query names of the spanning reads which support the current event
- spanning_reads
int
- the number of spanning reads which support the event- stranded
bool
- Specifies if the sequencing protocol was strand specific or not. Expects a boolean- tools
- The tools that called the event originally from the cluster step. Should be a semi-colon delimited list of <tool name>_<tool version>
- transcript1
- Transcript for the current annotation at the first breakpoint
- transcript2
- Transcript for the current annotation at the second breakpoint
- untemplated_seq
str
- The untemplated/novel sequence between the breakpoints- validation_id
- Identifier for the validation step
-
mavis.constants.
DISEASE_STATUS
= <vocab.vocab.Vocab object>¶ Vocab
– holds controlled vocabulary for allowed disease statusDISEASED
: diseasedNORMAL
: normal
-
mavis.constants.
GENE_PRODUCT_TYPE
= <vocab.vocab.Vocab object>¶ Vocab
– controlled vocabulary for gene productsSENSE
: the gene product is a sense fusionANTI_SENSE
: the gene product is anti-sense
-
mavis.constants.
GIESMA_STAIN
= <vocab.vocab.Vocab object>¶ Vocab
– holds controlled vocabulary relating to stains of chromosome bands
-
mavis.constants.
NA_MAPPING_QUALITY
= 255¶ int
– mapping quality value to indicate mapping was not performed/calculated
-
mavis.constants.
ORIENT
= <vocab.vocab.Vocab object>¶ Vocab
– holds controlled vocabulary for allowed orientation valuesLEFT
: left wrt to the positive/forward strandRIGHT
: right wrt to the positive/forward strandNS
: orientation is not specified
-
mavis.constants.
PRIME
= <vocab.vocab.Vocab object>¶ Vocab
– holds controlled vocabularyFIVE
: five primeTHREE
: three prime
-
mavis.constants.
PROTOCOL
= <vocab.vocab.Vocab object>¶ Vocab
– holds controlled vocabulary for allowed protocol valuesGENOME
: genomeTRANS
: transcriptome
-
mavis.constants.
PYSAM_READ_FLAGS
= <vocab.vocab.Vocab object>¶ Vocab
– Enum-like. For readable PYSAM flag constantsMULTIMAP
: template having multiple segments in sequencingUNMAPPED
: segment unmappedMATE_UNMAPPED
: next segment in the template unmappedREVERSE
: SEQ being reverse complementedMATE_REVERSE
: SEQ of the next segment in the template being reverse complementedFIRST_IN_PAIR
: the first segment in the templateLAST_IN_PAIR
: the last segment in the templateSECONDARY
: secondary alignmentSUPPLEMENTARY
: supplementary alignment
note: descriptions are taken from the samfile documentation
-
mavis.constants.
STRAND
= <vocab.vocab.Vocab object>¶ Vocab
– holds controlled vocabulary for allowed strand valuesPOS
: the positive/forward strandNEG
: the negative/reverse strandNS
: strand is not specified
-
mavis.constants.
SVTYPE
= <vocab.vocab.Vocab object>¶ Vocab
– holds controlled vocabulary for acceptable structural variant classificationsDEL
: deletionTRANS
: translocationITRANS
: inverted translocationINV
: inversionINS
: insertionDUP
: duplication
-
mavis.constants.
reverse_complement
(s)[source]¶ wrapper for the Bio.Seq reverse_complement method
Parameters: s (str) – the input DNA sequence Returns: the reverse complement of the input sequence Return type: str
Warning
assumes the input is a DNA sequence
Example
>>> reverse_complement('ATCCGGT') 'ACCGGAT'