genomic module¶

class mavis.annotate.genomic.Exon(start, end, transcript=None, name=None, intact_start_splice=True, intact_end_splice=True, seq=None, strand=None)[source]¶

Bases: mavis.annotate.base.BioInterval

Parameters:	start (int) – the genomic start position end (int) – the genomic end position name (str) – the name of the exon transcript (usTranscript) – the ‘parent’ transcript this exon belongs to intact_start_splice (bool) – if the starting splice site has been abrogated intact_end_splice (bool) – if the end splice site has been abrogated
Raises:	`AttributeError` – if the exon start > the exon end

Example

>>> Exon(15, 78)

acceptor¶: int – returns the genomic exonic position of the acceptor splice site

acceptor_splice_site¶: Interval – the genomic range describing the splice site

donor¶: int – returns the genomic exonic position of the donor splice site

donor_splice_site¶: Interval – the genomic range describing the splice site

transcript¶: usTranscript – the transcript this exon belongs to

class mavis.annotate.genomic.Gene(chr, start, end, name=None, strand='?', aliases=None, seq=None)[source]¶

Bases: mavis.annotate.base.BioInterval

Parameters:	chr (str) – the chromosome name (str) – the gene name/id i.e. ENSG0001 strand (STRAND) – the genomic strand ‘+’ or ‘-‘ aliases (`list` of `str`) – a list of aliases. For example the hugo name could go here seq (str) – genomic seq of the gene

Example

>>> Gene('X', 1, 1000, 'ENG0001', '+', ['KRAS'])

chr¶: returns the name of the chromosome that this gene resides on

get_seq(REFERENCE_GENOME, ignore_cache=False)[source]¶

gene sequence is always given wrt to the positive forward strand regardless of gene strand

Parameters:	REFERENCE_GENOME (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input REFERENCE_GENOME
Returns:	the sequence of the gene
Return type:	str

key()[source]¶: see structural_variant.annotate.base.BioInterval.key()

spliced_transcripts¶: list of Transcript – list of transcripts

to_dict()[source]¶: see structural_variant.annotate.base.BioInterval.to_dict()

transcript_priority(transcript)[source]¶: prioritizes transcripts from 0 to n-1 based on best transcript flag and then alphanumeric name sort

Warning

Lower number means higher priority. This is to make sort work by default

transcripts¶: list of usTranscript – list of unspliced transcripts

translations¶: list of Translation – list of translations

class mavis.annotate.genomic.IntergenicRegion(chr, start, end, strand)[source]¶

Bases: mavis.annotate.base.BioInterval

Parameters:	chr (str) – the reference object/chromosome for this region start (int) – the start of the IntergenicRegion end (int) – the end of the IntergenicRegion strand (STRAND) – the strand the region is defined on

Example

>>> IntergenicRegion('1', 1, 100, '+')

chr¶: returns the name of the chromosome that this gene resides on

key()[source]¶: see structural_variant.annotate.base.BioInterval.key()

to_dict()[source]¶: see structural_variant.annotate.base.BioInterval.to_dict()

class mavis.annotate.genomic.Template(name, start, end, seq=None, bands=None)[source]¶: Bases: mavis.annotate.base.BioInterval

class mavis.annotate.genomic.Transcript(ust, splicing_patt, seq=None, translations=None)[source]¶

Bases: mavis.annotate.base.BioInterval

splicing pattern is given in genomic coordinates

Parameters:	us_transcript (usTranscript) – the unspliced transcript splicing_patt (`list` of `int`) – the list of splicing positions seq (str) – the cdna sequence translations (`list` of `Translation`) – the list of translations of this transcript

convert_cdna_to_genomic(pos)[source]¶

Parameters:	pos (int) – cdna position
Returns:	the genomic equivalent
Return type:	int

convert_genomic_to_cdna(pos)[source]¶

Parameters:	pos (int) – the genomic position to be converted
Returns:	the cdna equivalent
Return type:	int
Raises:	`IndexError` – when a genomic position not present in the cdna is attempted to be converted

convert_genomic_to_nearest_cdna(pos)[source]¶

get_seq(REFERENCE_GENOME=None, ignore_cache=False)[source]¶

Parameters:	REFERENCE_GENOME (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input REFERENCE_GENOME
Returns:	the sequence corresponding to the spliced cdna
Return type:	str

unspliced_transcript¶: usTranscript – the unspliced transcript this splice variant belongs to

class mavis.annotate.genomic.usTranscript(exons, gene=None, name=None, strand=None, spliced_transcripts=None, seq=None, is_best_transcript=False)[source]¶

Bases: mavis.annotate.base.BioInterval

creates a new transcript object

Parameters:

exons (list of Exon) – list of Exon that make up the transcript
genomic_start (int) – genomic start position of the transcript
genomic_end (int) – genomic end position of the transcript
gene (Gene) – the gene this transcript belongs to
name (str) – name of the transcript
strand (STRAND) – strand the transcript is on, defaults to the strand of the Gene if not specified
seq (str) – unspliced cDNA seq

convert_cdna_to_genomic(pos, splicing_pattern)[source]¶

Parameters:	pos (int) – cdna position splicing_pattern (SplicingPattern) – list of genomic splice sites 3‘5’ repeating
Returns:	the genomic equivalent
Return type:	int

convert_genomic_to_cdna(pos, splicing_pattern)[source]¶

Parameters:	pos (int) – the genomic position to be converted splicing_pattern (SplicingPattern) – list of genomic splice sites 3‘5’ repeating
Returns:	the cdna equivalent
Return type:	int
Raises:	`IndexError` – when a genomic position not present in the cdna is attempted to be converted

convert_genomic_to_nearest_cdna(pos, splicing_pattern)[source]¶

converts a genomic position to its cdna equivalent or (if intronic) the nearest cdna and shift

Parameters:

pos (int) – the genomic position
splicing_pattern (SplicingPattern) – the splicing pattern

Returns:

int - the exonic cdna position
int - the intronic shift

Return type:

tuple of int and int

exon_number(exon)[source]¶

exon numbering is based on the direction of translation

Parameters:	exon (Exon) – the exon to be numbered
Returns:	the exon number (1 based)
Return type:	int
Raises:	`AttributeError` – if the strand is not given or the exon does not belong to the transcript

gene¶: Gene – the gene this transcript belongs to

generate_splicing_patterns()[source]¶

returns a list of splice sites to be connected as a splicing pattern

Returns:	List of positions to be spliced together
Return type:	`list` of `SplicingPattern`

see theory - predicting splicing patterns

get_cdna_seq(splicing_pattern, REFERENCE_GENOME=None, ignore_cache=False)[source]¶

Parameters:	splicing_pattern (SplicingPattern) – the list of splicing positions REFERENCE_GENOME (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input REFERENCE_GENOME
Returns:	the spliced cDNA sequence
Return type:	str

get_seq(REFERENCE_GENOME=None, ignore_cache=False)[source]¶

Parameters:	REFERENCE_GENOME (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input REFERENCE_GENOME
Returns:	the sequence of the transcript including introns (but relative to strand)
Return type:	str

transcripts¶: list of Transcript – list of spliced transcripts

translations¶: list of Translation – list of translations associated with this transcript