file_io module

module which holds all functions relating to loading reference files

class mavis.annotate.file_io.ReferenceFile(file_type, *filepaths, eager_load=False, assert_exists=False, **opt)[source]

Bases: object

Parameters
  • *filepaths (str) – list of paths to load

  • file_type (str) – Type of file to load

  • eager_load (bool=False) – load the files immeadiately

  • assert_exists (bool=False) – check that all files exist

  • **opt – key word arguments to be passed to the load function and used as part of the file cache key

Raises

FileNotFoundError: when assert_exists and an input does not exist

CACHE = {}
LOAD_FUNCTIONS = {'aligner_reference': None, 'annotations': <function load_annotations>, 'dgv_annotation': <function load_masking_regions>, 'masking': <function load_masking_regions>, 'reference_genome': <function load_reference_genome>, 'template_metadata': <function load_templates>}

Mapping of file types (based on ENV name) to load functions

Type

dict

files_exist(not_empty=False)[source]
is_empty()[source]
is_loaded()[source]
load(ignore_cache=False, verbose=True)[source]

load (or return) the contents of a reference file and add it to the cache if enabled

mavis.annotate.file_io.convert_tab_to_json(filepath, warn=<mavis.util.Log object>)[source]

given a file in the std input format (see below) reads and return a list of genes (and sub-objects)

column name

example

description

ensembl_transcript_id

ENST000001

ensembl_gene_id

ENSG000001

strand

-1

positive or negative 1

cdna_coding_start

44

where translation begins relative to the start of the cdna

cdna_coding_end

150

where translation terminates

genomic_exon_ranges

100-201;334-412;779-830

semi-colon demitited exon start/ends

AA_domain_ranges

DBD:220-251,260-271

semi-colon delimited list of domains

hugo_names

KRAS

hugo gene name

Parameters

filepath (str) – path to the input tab-delimited file

Returns

a dictionary keyed by chromosome name with values of list of genes on the chromosome

Return type

dict of list of Gene by str

Example

>>> ref = load_reference_genes('filename')
>>> ref['1']
[Gene(), Gene(), ....]

Warning

does not load translations unless then start with ‘M’, end with ‘*’ and have a length of multiple 3

mavis.annotate.file_io.load_annotations(*filepaths, warn=<mavis.util.Log object>, reference_genome=None, best_transcripts_only=False)[source]

loads gene models from an input file. Expects a tabbed or json file.

Parameters
  • filepath (str) – path to the input file

  • verbose (bool) – output extra information to stdout

  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name

  • filetype (str) – json or tab/tsv. only required if the file type can’t be interpolated from the path extension

Returns

lists of genes keyed by chromosome name

Return type

dict of list of Gene by str

mavis.annotate.file_io.load_masking_regions(*filepaths)[source]

reads a file of regions. The expect input format for the file is tab-delimited and the header should contain the following columns

  • chr: the chromosome

  • start: start of the region, 1-based inclusive

  • end: end of the region, 1-based inclusive

  • name: the name/label of the region

For example:

#chr    start   end     name
chr20   25600000        27500000        centromere
Parameters

filepath (str) – path to the input tab-delimited file

Returns

a dictionary keyed by chromosome name with values of lists of regions on the chromosome

Return type

dict of list of BioInterval by str

Example

>>> m = load_masking_regions('filename')
>>> m['1']
[BioInterval(), BioInterval(), ...]
mavis.annotate.file_io.load_reference_genes(*pos, **kwargs)[source]

Deprecated Use load_annotations() instead

mavis.annotate.file_io.load_reference_genome(*filepaths)[source]
Parameters

filepaths (list of str) – the paths to the files containing the input fasta genomes

Returns

a dictionary representing the sequences in the fasta file

Return type

dict of Bio.SeqRecord by str

mavis.annotate.file_io.load_templates(*filepaths)[source]

primarily useful if template drawings are required and is not necessary otherwise assumes the input file is 0-indexed with [start,end) style. Columns are expected in the following order, tab-delimited. A header should not be given

  1. name

  2. start

  3. end

  4. band_name

  5. giemsa_stain

for example

chr1    0       2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25
Parameters

filename (str) – the path to the file with the cytoband template information

Returns

list of the templates loaded

Return type

list of Template

mavis.annotate.file_io.parse_annotations_json(data, reference_genome=None, best_transcripts_only=False, warn=<mavis.util.Log object>)[source]

parses a json of annotation information into annotation objects