Reference Input Files¶
There are several reference files that are required for full functionality of the MAVIS pipeline. If the same
reference file will be reused often then the user may find it helpful to set reasonable defaults. Default values
for any of the reference file arguments can be configured through MAVIS_
prefixed environment variables.
file | file type/format | environment variable |
---|---|---|
reference genome | fasta | MAVIS_REFERENCE_GENOME |
annotations | JSON or text/tabbed | MAVIS_ANNOTATIONS |
masking | text/tabbed | MAVIS_MASKING |
template metadata | text/tabbed | MAVIS_TEMPLATE_METADATA |
If the environment variables above are set they will be used as the default values when any step of the pipeline script is called (including generating the template config file)
Reference Genome¶
These are the sequence files in fasta format that are used in aligning and generating the fusion sequences.
Examples:
Template Metadata¶
This is the file which contains the band information for the chromosomes. This is only used during visualization.
Examples:
chr1 0 2300000 p36.33 gneg
chr1 2300000 5400000 p36.32 gpos25
chr1 5400000 7200000 p36.31 gneg
chr1 7200000 9200000 p36.23 gpos25
chr1 9200000 12700000 p36.22 gneg
Masking File¶
File which contains regions that we should ignore calls in. This can be used to filter out regions with known false positives, bad mapping, centromeres, telomeres etc. An example is shown below
#chr start end name
chr1 0 2300000 centromere
chr1 9200000 12700000 telomere
Annotations¶
This is a custom file format. Essentially just a tabbed or JSON file which contains the gene, transcript, exon, translation and protein domain positional information
Warning
the load_reference_genes()
will
only load valid translations. If the cds sequence in the annotation is not
a multiple of CODON_SIZE
or if a
reference genome (sequences) is given and the cds start and end are not
M and * amino acids as expected the translation is not loaded
Example of the JSON file structure can be seen below
[
{
"name": string,
"start": int,
"end": int
"aliases": [string, string, ...],
"transcripts": [
{
"name": string,
"start": int,
"end": int,
"exons": [
{"start": int, "end": int, "name": string},
...
],
"cdna_coding_start": int,
"cdna_coding_end": int,
"domains": [
{
"name": string,
"regions": [
{"start" aa_start, "end": aa_end}
],
"desc": string
},
...
]
},
...
]
},
...
}
This reference file can be generated from any database with the necessary information.
Generating the Annotations from Ensembl¶
There is a helper script included with mavis to facilitate generating the custom annotations file from an instance of the ensembl database. This uses the Ensembl perl api to connect and pull information from the database. This has been tested with both Ensembl69 and Ensembl79.
Instructions for downloading and installing the perl api can be found on the ensembl site
- Make sure the ensembl perl api modules are added to the PERL5LIB environment variable
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/bioperl-live
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/ensembl/modules
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/ensembl-compara/modules
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/ensembl-variation/modules
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/ensembl-funcgen/modules
export PERL5LIB
- Configure the environment variables to set defaults for the perl script
# required data files
export HUGO_ENSEMBL_MAPPING=/path/to/mapping/file
export BEST_TRANSCRIPTS=/path/to/transcripts/file
# connection information for the ensembl local (or external) server
export ENSEMBL_HOST=HOSTNAME
export ENSEMBL_PASS=PASSWORD
export ENSEMBL_USER=USERNAME
export ENSEMBL_PORT=PORT_NUMBER
- Run the perl script
you can view the help menu by running
perl generate_ensembl_json.pl
you can override the default input file parameters (configured in the above step) by providing arguments to the script itself
perl generate_ensembl_json.pl --best_transcript_file /path/to/best/transcripts/file --output /path/to/output/json/file.json
or if you have configured the environment variables as given in step 2, then simply provide the output path
perl generate_ensembl_json.pl --output /path/to/output/json/file.json