Glossary¶

General Terms¶

2bit: File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format7.
BAM: File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format5.1.
bed: File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format1.
blat: Alignment tool. see https://genome.ucsc.edu/FAQ/FAQblat.html#blat3 for instructions on download and install.
BreakDancer: BreakDancer is an SV caller. Source for BreakDancer can be found here [Chen-2009]
breakpoint: A breakpoint is a genomic position (interval) on some reference/template/chromosome which has a strand and orientation. The orientation describes the portion of the reference that is retained.
breakpoint pair: Basic definition of a structural variant. Does not automatically imply a classification/type.
BreakSeq: BreakSeq is an SV caller. Source for BreakSeq can be found here [Abyzov-2015]
BWA: BWA is an alignement tool. See https://github.com/lh3/bwa for install instructions.
Chimerascan: Chimerascan is an SV caller. Source for Chimerascan can be found here [Iyer-2011]
CNVnator: CNVnator is an SV caller. Source for CNVnator can be found here [Abyzov-2011]
DeFuse: DeFuse is an SV caller. Source for DeFuse can be found here [McPherson-2011]
DELLY: DELLY is an SV caller. Source for DELLY can be found here [Rausch-2012]
event: Used interchangeably with structural variant.
event type: Classification for a structural variant. see event_type.
fasta: File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format18.
flanking read pair: A pair of reads where one read maps to one side of a set of breakpoints and its mate maps to the other.
half-mapped read: A read whose mate is unaligned. Generally this refers to reads in the evidence stage that are mapped next to a breakpoint.
HGVS: Community based standard of reccommendations for variant notation. See http://varnomen.hgvs.org/
IGV: Integrative Genomics Viewer is a visualization tool. see http://software.broadinstitute.org/software/igv.
IGV batch file: This is a file format type defined by IGV see running IGV with a batch file.
JSON: JSON (JavaScript Object Notation) is a data file format. see https://www.w3schools.com/js/js_json_intro.asp.
Manta: Manta is an SV caller. Source for Manta can be found here [Chen-2016]
Pindel: Pindel is an SV caller. Source for Pindel can be found here [Ye-2009]
psl: File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format2.
pslx: Extended format of a psl.
SGE: Sun Grid Engine (SGE) is a job scheduling system for cluster management see http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html.
SLURM: SLURM is a job scheduling system for cluster management see https://slurm.schedmd.com/archive/slurm-17.02.1.
spanning read: Applies primarily to small structural variants. Reads which span both breakpoints.
split read: A read which aligns next to a breakpoint and is softclipped at one or more sides.
STAR-Fusion: STAR-Fusion is an SV caller. Source for STAR-Fusion can be found here [Haas-2017]
Strelka: Strelka is an SNV and small indel caller. Only the small indels can be processed, since SNVs are not currently suported. Source for Strelka can be found here [Saunders-2012]
structural variant: A genomic alteration that can be described by a pair of breakpoints and an event type. The two breakpoints represent regions in the genome that are broken apart and reattached together.
SV: Structural Variant
SVG: SVG (Scalable vector graph) is an image format. see https://www.w3schools.com/graphics/svg_intro.asp.
TORQUE: TORQUE is a job scheduling system for cluster management see http://www.adaptivecomputing.com/products/open-source/torque/.
Trans-ABySS: Trans-ABySS is an SV caller. Source for Trans-ABySS can be found here [Robertson-2010]

Configurable Settings¶

aligner: SUPPORTED_ALIGNER - The aligner to use to map the contigs/reads back to the reference e.g blat or bwa. The corresponding environment variable is MAVIS_ALIGNER and the default value is 'blat'. Accepted values include: 'bwa mem', 'blat'
aligner_reference: filepath - Path to the aligner reference file used for aligning the contig sequences. The corresponding environment variable is MAVIS_ALIGNER_REFERENCE and the default value is None
annotation_filters: str - A comma separated list of filters to apply to putative annotations. The corresponding environment variable is MAVIS_ANNOTATION_FILTERS and the default value is 'choose_more_annotated,choose_transcripts_by_priority'
annotation_memory: int - Default memory limit (mb) for the annotation stage. The corresponding environment variable is MAVIS_ANNOTATION_MEMORY and the default value is 12000
annotations: filepath - Path to the reference annotations of genes, transcript, exons, domains, etc. The corresponding environment variable is MAVIS_ANNOTATIONS and the default value is []
assembly_kmer_size: float_fraction - The percent of the read length to make kmers for assembly. The corresponding environment variable is MAVIS_ASSEMBLY_KMER_SIZE and the default value is 0.74
assembly_max_paths: int - The maximum number of paths to resolve. this is used to limit when there is a messy assembly graph to resolve. the assembly will pre-calculate the number of paths (or putative assemblies) and stop if it is greater than the given setting. The corresponding environment variable is MAVIS_ASSEMBLY_MAX_PATHS and the default value is 8
assembly_min_edge_trim_weight: int - This is used to simplify the debruijn graph before path finding. edges with less than this frequency will be discarded if they are non-cutting, at a fork, or the end of a path. The corresponding environment variable is MAVIS_ASSEMBLY_MIN_EDGE_TRIM_WEIGHT and the default value is 3
assembly_min_exact_match_to_remap: int - The minimum length of exact matches to initiate remapping a read to a contig. The corresponding environment variable is MAVIS_ASSEMBLY_MIN_EXACT_MATCH_TO_REMAP and the default value is 15
assembly_min_remap_coverage: float_fraction - Minimum fraction of the contig sequence which the remapped sequences must align over. The corresponding environment variable is MAVIS_ASSEMBLY_MIN_REMAP_COVERAGE and the default value is 0.9
assembly_min_remapped_seq: int - The minimum input sequences that must remap for an assembled contig to be used. The corresponding environment variable is MAVIS_ASSEMBLY_MIN_REMAPPED_SEQ and the default value is 3
assembly_min_uniq: float_fraction - Minimum percent uniq required to keep separate assembled contigs. if contigs are more similar then the lower scoring, then shorter, contig is dropped. The corresponding environment variable is MAVIS_ASSEMBLY_MIN_UNIQ and the default value is 0.1
assembly_strand_concordance: float_fraction - When the number of remapped reads from each strand are compared, the ratio must be above this number to decide on the strand. The corresponding environment variable is MAVIS_ASSEMBLY_STRAND_CONCORDANCE and the default value is 0.51
blat_limit_top_aln: int - Number of results to return from blat (ranking based on score). The corresponding environment variable is MAVIS_BLAT_LIMIT_TOP_ALN and the default value is 10
blat_min_identity: float_fraction - The minimum percent identity match required for blat results when aligning contigs. The corresponding environment variable is MAVIS_BLAT_MIN_IDENTITY and the default value is 0.9
breakpoint_color: str - Breakpoint outline color. The corresponding environment variable is MAVIS_BREAKPOINT_COLOR and the default value is '#000000'
call_error: int - Buffer zone for the evidence window. The corresponding environment variable is MAVIS_CALL_ERROR and the default value is 10
clean_aligner_files: bool - Remove the aligner output files after the validation stage is complete. not required for subsequent steps but can be useful in debugging and deep investigation of events. The corresponding environment variable is MAVIS_CLEAN_ALIGNER_FILES and the default value is False
cluster_initial_size_limit: int - The maximum cumulative size of both breakpoints for breakpoint pairs to be used in the initial clustering phase (combining based on overlap). The corresponding environment variable is MAVIS_CLUSTER_INITIAL_SIZE_LIMIT and the default value is 25
cluster_radius: int - Maximum distance allowed between paired breakpoint pairs. The corresponding environment variable is MAVIS_CLUSTER_RADIUS and the default value is 100
concurrency_limit: int - The concurrency limit for tasks in any given job array or the number of concurrent processes allowed for a local run. The corresponding environment variable is MAVIS_CONCURRENCY_LIMIT and the default value is None
contig_aln_max_event_size: int - Relates to determining breakpoints when pairing contig alignments. for any given read in a putative pair the soft clipping is extended to include any events of greater than this size. the softclipping is added to the side of the alignment as indicated by the breakpoint we are assigning pairs to. The corresponding environment variable is MAVIS_CONTIG_ALN_MAX_EVENT_SIZE and the default value is 50
contig_aln_merge_inner_anchor: int - The minimum number of consecutive exact match base pairs to not merge events within a contig alignment. The corresponding environment variable is MAVIS_CONTIG_ALN_MERGE_INNER_ANCHOR and the default value is 20
contig_aln_merge_outer_anchor: int - Minimum consecutively aligned exact matches to anchor an end for merging internal events. The corresponding environment variable is MAVIS_CONTIG_ALN_MERGE_OUTER_ANCHOR and the default value is 15
contig_aln_min_anchor_size: int - The minimum number of aligned bases for a contig (m or =) in order to simplify. do not have to be consecutive. The corresponding environment variable is MAVIS_CONTIG_ALN_MIN_ANCHOR_SIZE and the default value is 50
contig_aln_min_extend_overlap: int - Minimum number of bases the query coverage interval must be extended by in order to pair alignments as a single split alignment. The corresponding environment variable is MAVIS_CONTIG_ALN_MIN_EXTEND_OVERLAP and the default value is 10
contig_aln_min_query_consumption: float_fraction - Minimum fraction of the original query sequence that must be used by the read(s) of the alignment. The corresponding environment variable is MAVIS_CONTIG_ALN_MIN_QUERY_CONSUMPTION and the default value is 0.9
contig_aln_min_score: float_fraction - Minimum score for a contig to be used as evidence in a call by contig. The corresponding environment variable is MAVIS_CONTIG_ALN_MIN_SCORE and the default value is 0.9
contig_call_distance: int - The maximum distance allowed between breakpoint pairs (called by contig) in order for them to pair. The corresponding environment variable is MAVIS_CONTIG_CALL_DISTANCE and the default value is 10
dgv_annotation: filepath - Path to the dgv reference processed to look like the cytoband file. The corresponding environment variable is MAVIS_DGV_ANNOTATION and the default value is []
domain_color: str - Domain fill color. The corresponding environment variable is MAVIS_DOMAIN_COLOR and the default value is '#ccccb3'
domain_mismatch_color: str - Domain fill color on 0%% match. The corresponding environment variable is MAVIS_DOMAIN_MISMATCH_COLOR and the default value is '#b2182b'
domain_name_regex_filter: str - The regular expression used to select domains to be displayed (filtered by name). The corresponding environment variable is MAVIS_DOMAIN_NAME_REGEX_FILTER and the default value is '^PF\\d+$'
domain_scaffold_color: str - The color of the domain scaffold. The corresponding environment variable is MAVIS_DOMAIN_SCAFFOLD_COLOR and the default value is '#000000'
draw_fusions_only: bool - Flag to indicate if events which do not produce a fusion transcript should produce illustrations. The corresponding environment variable is MAVIS_DRAW_FUSIONS_ONLY and the default value is True
draw_non_synonymous_cdna_only: bool - Flag to indicate if events which are synonymous at the cdna level should produce illustrations. The corresponding environment variable is MAVIS_DRAW_NON_SYNONYMOUS_CDNA_ONLY and the default value is True
drawing_width_iter_increase: int - The amount (in pixels) by which to increase the drawing width upon failure to fit. The corresponding environment variable is MAVIS_DRAWING_WIDTH_ITER_INCREASE and the default value is 500
exon_min_focus_size: int - Minimum size of an exon for it to be granted a label or min exon width. The corresponding environment variable is MAVIS_EXON_MIN_FOCUS_SIZE and the default value is 10
fetch_min_bin_size: int - The minimum size of any bin for reading from a bam file. increasing this number will result in smaller bins being merged or less bins being created (depending on the fetch method). The corresponding environment variable is MAVIS_FETCH_MIN_BIN_SIZE and the default value is 50
fetch_reads_bins: int - Number of bins to split an evidence window into to ensure more even sampling of high coverage regions. The corresponding environment variable is MAVIS_FETCH_READS_BINS and the default value is 5
fetch_reads_limit: int - Maximum number of reads, cap, to loop over for any given evidence window. The corresponding environment variable is MAVIS_FETCH_READS_LIMIT and the default value is 3000
filter_cdna_synon: bool - Filter all annotations synonymous at the cdna level. The corresponding environment variable is MAVIS_FILTER_CDNA_SYNON and the default value is True
filter_min_complexity: float_fraction - Filter event calls based on call sequence complexity. The corresponding environment variable is MAVIS_FILTER_MIN_COMPLEXITY and the default value is 0.2
filter_min_flanking_reads: int - Minimum number of flanking pairs for a call by flanking pairs. The corresponding environment variable is MAVIS_FILTER_MIN_FLANKING_READS and the default value is 10
filter_min_linking_split_reads: int - Minimum number of linking split reads for a call by split reads. The corresponding environment variable is MAVIS_FILTER_MIN_LINKING_SPLIT_READS and the default value is 1
filter_min_remapped_reads: int - Minimum number of remapped reads for a call by contig. The corresponding environment variable is MAVIS_FILTER_MIN_REMAPPED_READS and the default value is 5
filter_min_spanning_reads: int - Minimum number of spanning reads for a call by spanning reads. The corresponding environment variable is MAVIS_FILTER_MIN_SPANNING_READS and the default value is 5
filter_min_split_reads: int - Minimum number of split reads for a call by split reads. The corresponding environment variable is MAVIS_FILTER_MIN_SPLIT_READS and the default value is 5
filter_protein_synon: bool - Filter all annotations synonymous at the protein level. The corresponding environment variable is MAVIS_FILTER_PROTEIN_SYNON and the default value is False
filter_secondary_alignments: bool - Filter secondary alignments when gathering read evidence. The corresponding environment variable is MAVIS_FILTER_SECONDARY_ALIGNMENTS and the default value is True
filter_trans_homopolymers: bool - Filter all single bp ins/del/dup events that are in a homopolymer region of at least 3 bps and are not paired to a genomic event. The corresponding environment variable is MAVIS_FILTER_TRANS_HOMOPOLYMERS and the default value is True
flanking_call_distance: int - The maximum distance allowed between breakpoint pairs (called by flanking pairs) in order for them to pair. The corresponding environment variable is MAVIS_FLANKING_CALL_DISTANCE and the default value is 50
fuzzy_mismatch_number: int - The number of events/mismatches allowed to be considered a fuzzy match. The corresponding environment variable is MAVIS_FUZZY_MISMATCH_NUMBER and the default value is 1
gene1_color: str - The color of genes near the first gene. The corresponding environment variable is MAVIS_GENE1_COLOR and the default value is '#657e91'
gene1_color_selected: str - The color of the first gene. The corresponding environment variable is MAVIS_GENE1_COLOR_SELECTED and the default value is '#518dc5'
gene2_color: str - The color of genes near the second gene. The corresponding environment variable is MAVIS_GENE2_COLOR and the default value is '#325556'
gene2_color_selected: str - The color of the second gene. The corresponding environment variable is MAVIS_GENE2_COLOR_SELECTED and the default value is '#4c9677'
import_env: bool - Flag to import environment variables. The corresponding environment variable is MAVIS_IMPORT_ENV and the default value is True
input_call_distance: int - The maximum distance allowed between breakpoint pairs (called by input tools, not validated) in order for them to pair. The corresponding environment variable is MAVIS_INPUT_CALL_DISTANCE and the default value is 20
label_color: str - The label color. The corresponding environment variable is MAVIS_LABEL_COLOR and the default value is '#000000'
limit_to_chr: str - A list of chromosome names to use. breakpointpairs on other chromosomes will be filteredout. for example ‘1 2 3 4’ would filter out events/breakpoint pairs on any chromosomes but 1, 2, 3, and 4. The corresponding environment variable is MAVIS_LIMIT_TO_CHR and the default value is ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']
mail_type: MAIL_TYPE - When to notify the mail_user (if given). The corresponding environment variable is MAVIS_MAIL_TYPE and the default value is 'NONE'. Accepted values include: 'BEGIN', 'END', 'FAIL', 'ALL', 'NONE'
mail_user: str - User(s) to send notifications to. The corresponding environment variable is MAVIS_MAIL_USER and the default value is ''
mask_fill: str - Color of mask (for deleted region etc.). The corresponding environment variable is MAVIS_MASK_FILL and the default value is '#ffffff'
mask_opacity: float_fraction - Opacity of the mask layer. The corresponding environment variable is MAVIS_MASK_OPACITY and the default value is 0.7
masking: filepath - File containing regions for which input events overlapping them are dropped prior to validation. The corresponding environment variable is MAVIS_MASKING and the default value is []
max_drawing_retries: int - The maximum number of retries for attempting a drawing. each iteration the width is extended. if it is still insufficient after this number a gene-level only drawing will be output. The corresponding environment variable is MAVIS_MAX_DRAWING_RETRIES and the default value is 5
max_files: int - The maximum number of files to output from clustering/splitting. The corresponding environment variable is MAVIS_MAX_FILES and the default value is 200
max_orf_cap: int - The maximum number of orfs to return (best putative orfs will be retained). The corresponding environment variable is MAVIS_MAX_ORF_CAP and the default value is 3
max_proximity: int - The maximum distance away from an annotation before the region in considered to be uninformative. The corresponding environment variable is MAVIS_MAX_PROXIMITY and the default value is 5000
max_sc_preceeding_anchor: int - When remapping a softclipped read this determines the amount of softclipping allowed on the side opposite of where we expect it. for example for a softclipped read on a breakpoint with a left orientation this limits the amount of softclipping that is allowed on the right. if this is set to none then there is no limit on softclipping. The corresponding environment variable is MAVIS_MAX_SC_PRECEEDING_ANCHOR and the default value is 6
memory_limit: int - The maximum number of megabytes (mb) any given job is allowed. The corresponding environment variable is MAVIS_MEMORY_LIMIT and the default value is 16000
min_anchor_exact: int - Applies to re-aligning softclipped reads to the opposing breakpoint. the minimum number of consecutive exact matches to anchor a read to initiate targeted realignment. The corresponding environment variable is MAVIS_MIN_ANCHOR_EXACT and the default value is 6
min_anchor_fuzzy: int - Applies to re-aligning softclipped reads to the opposing breakpoint. the minimum length of a fuzzy match to anchor a read to initiate targeted realignment. The corresponding environment variable is MAVIS_MIN_ANCHOR_FUZZY and the default value is 10
min_anchor_match: float_fraction - Minimum percent match for a read to be kept as evidence. The corresponding environment variable is MAVIS_MIN_ANCHOR_MATCH and the default value is 0.9
min_call_complexity: float_fraction - The minimum complexity score for a call sequence. is an average for non-contig calls. filters low complexity contigs before alignment. see contig_complexity. The corresponding environment variable is MAVIS_MIN_CALL_COMPLEXITY and the default value is 0.1
min_clusters_per_file: int - The minimum number of breakpoint pairs to output to a file. The corresponding environment variable is MAVIS_MIN_CLUSTERS_PER_FILE and the default value is 50
min_domain_mapping_match: float_fraction - A number between 0 and 1 representing the minimum percent match a domain must map to the fusion transcript to be displayed. The corresponding environment variable is MAVIS_MIN_DOMAIN_MAPPING_MATCH and the default value is 0.9
min_double_aligned_to_estimate_insertion_size: int - The minimum number of reads which map soft-clipped to both breakpoints to assume the size of the untemplated sequence between the breakpoints is at most the read length - 2 * min_softclipping. The corresponding environment variable is MAVIS_MIN_DOUBLE_ALIGNED_TO_ESTIMATE_INSERTION_SIZE and the default value is 2
min_flanking_pairs_resolution: int - The minimum number of flanking reads required to call a breakpoint by flanking evidence. The corresponding environment variable is MAVIS_MIN_FLANKING_PAIRS_RESOLUTION and the default value is 10
min_linking_split_reads: int - The minimum number of split reads which aligned to both breakpoints. The corresponding environment variable is MAVIS_MIN_LINKING_SPLIT_READS and the default value is 2
min_mapping_quality: int - The minimum mapping quality of reads to be used as evidence. The corresponding environment variable is MAVIS_MIN_MAPPING_QUALITY and the default value is 5
min_non_target_aligned_split_reads: int - The minimum number of split reads aligned to a breakpoint by the input bam and no forced by local alignment to the target region to call a breakpoint by split read evidence. The corresponding environment variable is MAVIS_MIN_NON_TARGET_ALIGNED_SPLIT_READS and the default value is 1
min_orf_size: int - The minimum length (in base pairs) to retain a putative open reading frame (orf). The corresponding environment variable is MAVIS_MIN_ORF_SIZE and the default value is 300
min_sample_size_to_apply_percentage: int - Minimum number of aligned bases to compute a match percent. if there are less than this number of aligned bases (match or mismatch) the percent comparator is not used. The corresponding environment variable is MAVIS_MIN_SAMPLE_SIZE_TO_APPLY_PERCENTAGE and the default value is 10
min_softclipping: int - Minimum number of soft-clipped bases required for a read to be used as soft-clipped evidence. The corresponding environment variable is MAVIS_MIN_SOFTCLIPPING and the default value is 6
min_spanning_reads_resolution: int - Minimum number of spanning reads required to call an event by spanning evidence. The corresponding environment variable is MAVIS_MIN_SPANNING_READS_RESOLUTION and the default value is 5
min_splits_reads_resolution: int - Minimum number of split reads required to call a breakpoint by split reads. The corresponding environment variable is MAVIS_MIN_SPLITS_READS_RESOLUTION and the default value is 3
novel_exon_color: str - Novel exon fill color. The corresponding environment variable is MAVIS_NOVEL_EXON_COLOR and the default value is '#5D3F6A'
outer_window_min_event_size: int - The minimum size of an event in order for flanking read evidence to be collected. The corresponding environment variable is MAVIS_OUTER_WINDOW_MIN_EVENT_SIZE and the default value is 125
queue: str - The queue jobs are to be submitted to. The corresponding environment variable is MAVIS_QUEUE and the default value is ''
reference_genome: filepath - Path to the human reference genome fasta file. The corresponding environment variable is MAVIS_REFERENCE_GENOME and the default value is []
remote_head_ssh: str - Ssh target for remote scheduler commands. The corresponding environment variable is MAVIS_REMOTE_HEAD_SSH and the default value is ''
scaffold_color: str - The color used for the gene/transcripts scaffolds. The corresponding environment variable is MAVIS_SCAFFOLD_COLOR and the default value is '#000000'
scheduler: SCHEDULER - The scheduler being used. The corresponding environment variable is MAVIS_SCHEDULER and the default value is 'SLURM'. Accepted values include: 'SGE', 'SLURM', 'TORQUE', 'LOCAL'
spanning_call_distance: int - The maximum distance allowed between breakpoint pairs (called by spanning reads) in order for them to pair. The corresponding environment variable is MAVIS_SPANNING_CALL_DISTANCE and the default value is 20
splice_color: str - Splicing lines color. The corresponding environment variable is MAVIS_SPLICE_COLOR and the default value is '#000000'
split_call_distance: int - The maximum distance allowed between breakpoint pairs (called by split reads) in order for them to pair. The corresponding environment variable is MAVIS_SPLIT_CALL_DISTANCE and the default value is 20
stdev_count_abnormal: float - The number of standard deviations away from the normal considered expected and therefore not qualifying as flanking reads. The corresponding environment variable is MAVIS_STDEV_COUNT_ABNORMAL and the default value is 3.0
strand_determining_read: int - 1 or 2. the read in the pair which determines if (assuming a stranded protocol) the first or second read in the pair matches the strand sequenced. The corresponding environment variable is MAVIS_STRAND_DETERMINING_READ and the default value is 2
template_metadata: filepath - File containing the cytoband template information. used for illustrations only. The corresponding environment variable is MAVIS_TEMPLATE_METADATA and the default value is []
time_limit: int - The time in seconds any given jobs is allowed. The corresponding environment variable is MAVIS_TIME_LIMIT and the default value is 57600
trans_fetch_reads_limit: int - Related to fetch_reads_limit. overrides fetch_reads_limit for transcriptome libraries when set. if this has a value of none then fetch_reads_limit will be used for transcriptome libraries instead. The corresponding environment variable is MAVIS_TRANS_FETCH_READS_LIMIT and the default value is 12000
trans_min_mapping_quality: int - Related to min_mapping_quality. overrides the min_mapping_quality if the library is a transcriptome and this is set to any number not none. if this value is none, min_mapping_quality is used for transcriptomes aswell as genomes. The corresponding environment variable is MAVIS_TRANS_MIN_MAPPING_QUALITY and the default value is 0
trans_validation_memory: int - Default memory limit (mb) for the validation stage (for transcriptomes). The corresponding environment variable is MAVIS_TRANS_VALIDATION_MEMORY and the default value is 18000
uninformative_filter: bool - Flag that determines if breakpoint pairs which are not within max_proximity to any annotations are filtered out prior to clustering. The corresponding environment variable is MAVIS_UNINFORMATIVE_FILTER and the default value is False
validation_memory: int - Default memory limit (mb) for the validation stage. The corresponding environment variable is MAVIS_VALIDATION_MEMORY and the default value is 16000
width: int - The drawing width in pixels. The corresponding environment variable is MAVIS_WIDTH and the default value is 1000
write_evidence_files: bool - Write the intermediate bam and bed files containing the raw evidence collected and contigs aligned. not required for subsequent steps but can be useful in debugging and deep investigation of events. The corresponding environment variable is MAVIS_WRITE_EVIDENCE_FILES and the default value is True

Column Names¶

List of column names and their definitions. The types indicated here are the expected types in a row for a given column name.

annotation_figure: FILEPATH - File path to the svg drawing representing the annotation
annotation_figure_legend: JSON - JSON data for the figure legend
annotation_id: Identifier for the annotation step
break1_chromosome: str - The name of the chromosome on which breakpoint 1 is situated
break1_ewindow: int-int - Window where evidence was gathered for the first breakpoint
break1_ewindow_count: int - Number of reads processed/looked-at in the first evidence window
break1_ewindow_practical_coverage: float - break2_ewindow_practical_coverage, break1_ewindow_count / len(break1_ewindow). Not the actual coverage as bins are sampled within and there is a read limit cutoff
break1_homologous_seq: str - Sequence in common at the first breakpoint and other side of the second breakpoint
break1_orientation: ORIENT - The side of the breakpoint wrt the positive/forward strand that is retained.
break1_position_end: int - End integer inclusive 1-based of the range representing breakpoint 1
break1_position_start: int - Start integer inclusive 1-based of the range representing breakpoint 1
break1_seq: str - The sequence up to and including the breakpoint. Always given wrt to the positive/forward strand
break1_split_reads: int - Number of split reads that call the exact breakpoint given
break1_split_reads_forced: int - Number of split reads which were aligned to the opposite breakpoint window using a targeted alignment
break1_strand: STRAND - The strand wrt to the reference positive/forward strand at this breakpoint.
break2_chromosome: The name of the chromosome on which breakpoint 2 is situated
break2_ewindow: int-int - Window where evidence was gathered for the second breakpoint
break2_ewindow_count: int - Number of reads processed/looked-at in the second evidence window
break2_ewindow_practical_coverage: float - break2_ewindow_practical_coverage, break2_ewindow_count / len(break2_ewindow). Not the actual coverage as bins are sampled within and there is a read limit cutoff
break2_homologous_seq: str - Sequence in common at the second breakpoint and other side of the first breakpoint
break2_orientation: ORIENT - The side of the breakpoint wrt the positive/forward strand that is retained.
break2_position_end: int - End integer inclusive 1-based of the range representing breakpoint 2
break2_position_start: int - Start integer inclusive 1-based of the range representing breakpoint 2
break2_seq: str - The sequence up to and including the breakpoint. Always given wrt to the positive/forward strand
break2_split_reads: int - Number of split reads that call the exact breakpoint given
break2_split_reads_forced: int - Number of split reads which were aligned to the opposite breakpoint window using a targeted alignment
break2_strand: STRAND - The strand wrt to the reference positive/forward strand at this breakpoint.
call_method: CALL_METHOD - The method used to call the breakpoints
call_sequence_complexity: float - The minimum amount any two bases account for of the proportion of call sequence. An average for non-contig calls
cdna_synon: semi-colon delimited list of transcript ids which have an identical cdna sequence to the cdna sequence of the current fusion product
cluster_id: Identifier for the merging/clustering step
cluster_size: int - The number of breakpoint pair calls that were grouped in creating the cluster
contig_alignment_cigar: The cigar string(s) representing the contig alignment. Semi-colon delimited
contig_alignment_query_name: The query name for the contig alignment. Should match the ‘read’ name(s) in the .contigs.bam output file
contig_alignment_reference_start: The reference start(s) <chr>:<position> of the contig alignment. Semi-colon delimited
contig_alignment_score: float - A rank based on the alignment tool blat etc. of the alignment being used. An average if split alignments were used. Lower numbers indicate a better alignment. If it was the best alignment possible then this would be zero.
contig_build_score: int - Score representing the edge weights of all edges used in building the sequence
contig_remap_coverage: float - Fraction of the contig sequence which is covered by the remapped reads
contig_remap_score: float - Score representing the number of sequences from the set of sequences given to the assembly algorithm that were aligned to the resulting contig with an acceptable scoring based on user-set thresholds. For any sequence its contribution to the score is divided by the number of mappings to give less weight to multimaps
contig_remapped_read_names: read query names for the reads that were remapped. A -1 or -2 has been appended to the end of the name to indicate if this is the first or second read in the pair
contig_remapped_reads: int - the number of reads from the input bam that map to the assembled contig
contig_seq: str - Sequence of the current contig wrt to the positive forward strand if not strand specific
contig_strand_specific: bool - A flag to indicate if it was possible to resolve the strand for this contig
contigs_aligned: int - Number of contigs that were able to align
contigs_assembled: int - Number of contigs that were built from split read sequences
event_type: SVTYPE - The classification of the event
flanking_median_fragment_size: int - The median fragment size of the flanking reads being used as evidence
flanking_pairs: int - Number of read-pairs where one read aligns to the first breakpoint window and the second read aligns to the other. The count here is based on the number of unique query names
flanking_pairs_compatible: int - Number of flanking pairs of a compatible orientation type. This applies to insertions and duplications. Flanking pairs supporting an insertion will be compatible to a duplication and flanking pairs supporting a duplication will be compatible to an insertion (possibly indicating an internal translocation)
flanking_stdev_fragment_size: float - The standard deviation in fragment size of the flanking reads being used as evidence
fusion_cdna_coding_end: Position wrt the 5’ end of the fusion transcript where coding ends last base of the stop codon
fusion_cdna_coding_end: int - Position wrt the 5’ end of the fusion transcript where coding ends last base of the stop codon
fusion_cdna_coding_start: int - Position wrt the 5’ end of the fusion transcript where coding begins first base of the Met amino acid.
fusion_mapped_domains: JSON - List of domains in JSON format where each domain start and end positions are given wrt to the fusion transcript and the mapping quality is the number of matching amino acid positions over the total number of amino acids. The sequence is the amino acid sequence of the domain on the reference/original transcript
fusion_protein_hgvs: str - Describes the fusion protein in HGVS notation. Will be None if the change is not an indel or is synonymous
fusion_sequence_fasta_file: FILEPATH - Path to the corresponding fasta output file
fusion_sequence_fasta_id: The sequence identifier for the cdna sequence output fasta file
fusion_splicing_pattern: SPLICE_TYPE - Type of splicing pattern used to create the fusion cDNA.
gene1: Gene for the current annotation at the first breakpoint
gene1_aliases: Other gene names associated with the current annotation at the first breakpoint
gene1_direction: PRIME - The direction/prime of the gene
gene2: Gene for the current annotation at the second breakpoint
gene2_aliases: Other gene names associated with the current annotation at the second breakpoint
gene2_direction: PRIME - The direction/prime of the gene. Has the following possible values
gene_product_type: GENE_PRODUCT_TYPE - Describes if the putative fusion product will be sense or anti-sense
genes_encompassed: Applies to intrachromosomal events only. List of genes which overlap any region that occurs between both breakpoints. For example in a deletion event these would be deleted genes.
genes_overlapping_break1: list of genes which overlap the first breakpoint
genes_overlapping_break2: list of genes which overlap the second breakpoint
genes_proximal_to_break1: list of genes near the breakpoint and the distance away from the breakpoint
genes_proximal_to_break2: list of genes near the breakpoint and the distance away from the breakpoint
inferred_pairing: A semi colon delimited of event identifiers i.e. <annotation_id>_<splicing pattern>_<cds start>_<cds end> which were paired to the current event based on predicted products
library: Identifier for the library/source
linking_split_reads: int - Number of split reads that align to both breakpoints
net_size: int-int - The net size of an event. For translocations and inversion this will always be 0. For indels it will be negative for deletions and positive for insertions. It is a range to accommodate non-specific events.
opposing_strands: bool - Specifies if breakpoints are on opposite strands wrt to the reference. Expects a boolean
pairing: A semi colon delimited of event identifiers i.e. <annotation_id>_<splicing pattern>_<cds start>_<cds end> which were paired to the current event based on breakpoint positions
product_id: Unique identifier of the final fusion including splicing and ORF decision from the annotation step
protein_synon: semi-colon delimited list of transcript ids which produce a translation with an identical amino-acid sequence to the current fusion product
protocol: PROTOCOL - Specifies the type of library
raw_break1_split_reads: int - Number of split reads before calling the breakpoint
raw_break2_split_reads: int - Number of split reads before calling the breakpoint
raw_flanking_pairs: int - Number of flanking reads before calling the breakpoint. The count here is based on the number of unique query names
raw_spanning_reads: int - Number of spanning reads collected during evidence collection before calling the breakpoint
spanning_read_names: read query names of the spanning reads which support the current event
spanning_reads: int - the number of spanning reads which support the event
stranded: bool - Specifies if the sequencing protocol was strand specific or not. Expects a boolean
supplementary_call: bool - Flag to indicate if the current event was a supplementary call, meaning a call that was found as a result of validating another event.
tools: The tools that called the event originally from the cluster step. Should be a semi-colon delimited list of <tool name>_<tool version>
tracking_id: column used to store input identifiers from the original SV calls. Used to track calls from the input files to the final outputs.
transcript1: Transcript for the current annotation at the first breakpoint
transcript2: Transcript for the current annotation at the second breakpoint
untemplated_seq: str - The untemplated/novel sequence between the breakpoints
validation_id: Identifier for the validation step