Running the Pipeline¶
Running MAVIS using a Job Scheduler¶
The setup step of MAVIS is set up to use a Job Schedulers job scheduler on a compute cluster. will generate submission scripts and a wrapper bash script for the user to execute on their cluster head node.
Figure 1. The MAVIS pipeline is highly configurable. Some pipeline steps (cluster, validate) are optional and can be automatically skipped. The standard pipeline is far-left.¶
The most common use case is auto-generating a configuration file and then running the pipeline setup step. The pipeline setup step will run clustering and create scripts for running the other steps.
mavis config .... -w config.cfg
mavis setup config.cfg -o /path/to/top/output_dir
This will create the build.cfg configuration file, which is used by the scheduler to submit jobs. To use a particular scheduler you will need to set the MAVIS_SCHEDULER environment variable. After the build configuration file has been created you can run the mavis schedule option to submit your jobs
ssh cluster_head_node
mavis schedule -o /path/to/output_dir --submit
This will submit a series of jobs with dependencies.
Figure 2. Dependency graph of MAVIS jobs for the standard pipeline setup. The notation on the arrows indicates the SLURM setting on the job to add the dependency on the previous job.¶
Configuring Scheduler Settings¶
There are multiple ways to configure the scheduler settings. Some of the configurable options are listed below
For example to set the job queue default using an environment variable
Or it can also be added to the config file manually
Troubleshooting Dependency Failures¶
The most common error to occur when running MAVIS on the cluster is a memory or time limit exception. These can be detected by running the schedule step or looking for dependency failures reported on the cluster. The suffix of the job name will be a number and will correspond to the suffix of the job directory.
mavis schedule -o /path/to/output/dir
This will report any failed jobs. For example if this were a crash report for one of the validation jobs we might expect to see something like below in the schedule output
[2018-05-31 13:02:06] validate
MV_<library>_<batch id>-<task id> (<job id>) is FAILED
CRASH: <error from log file>
Any jobs in an error, failed, etc. state can be resubmitted by running mavis schedule with the resubmit flag
mavis schedule -o /path/to/output/dir --resubmit
If a job has failed due to memory or time limits, editing the /path/to/output/dir/build.cfg
file can allow the user to change a job without resetting up and rerunning the other jobs.
For example, below is the configuration for a validation job
stage = validate
job_ident = 1691742
name = MV_mock-A47933_batch-D2nTiy9AhGye4UZNapAik6
dependencies =
script = /path/to/output/dir/mock-A47933_diseased_transcriptome/validate/
status = FAILED
output_dir = /path/to/output/dir/mock-A47933_diseased_transcriptome/validate/batch-D2nTiy9AhGye4UZNapAik6-{task_ident}
stdout = /path/to/output/dir/mock-A47933_diseased_transcriptome/validate/batch-D2nTiy9AhGye4UZNapAik6-{task_ident}/job-{name}-{job_ident}-{task_ident}.log
created_at = 1527641526
status_comment =
memory_limit = 18000
queue = short
time_limit = 57600
import_env = True
mail_user =
mail_type = NONE
concurrency_limit = None
task_list = 1
The memory_limit is in Mb and the time_limit is in seconds. Editing the values here will cause the job to be resubmitted with the new values.
Incorrectly editing the build.cfg file may have unanticipated results and require re-setting up MAVIS to fix.
Generally the user should ONLY edit memory_limit
and time_limit
If memory errors are frequent then it would be better to adjust the default values (trans_validation_memory, validation_memory, time_limit)
MAVIS (Mini) Tutorial¶
This tutorial is based on the data included in the tests folder of MAVIS. The data files are very small and this tutorial is really only intended for testing a MAVIS install. The data here is simulated and results are not representitive of the typical events you would see reported from MAVIS. For a more complete tutorial with actual fusion gene examples, please see the MAVIS (Full) Tutorial below.
The first step is to clone or download a zip of the MAVIS repository ( You will need the tests directory. The tag you check out should correspond to the MAVIS version you have installed
git clone
git checkout v2.0.0
mv mavis/tests .
rm -r mavis
Now you should have a folder called tests
in your current directory. You will need to specify the scheduler
if you want to test one that is not the default. For example
Since this is a trivial example, it can easily be run locally. By default MAVIS in local mode will run a maximum of 1 less than the current cpu count processes. If you are running other things on the same machine you may find it useful to set this directly.
The above will limit mavis to running 2 processes concurrently.
Now you are ready to run MAVIS itself. This can be done in two commands (since the config file we are going to use is already built). First set up the pipeline
mavis setup tests/data/pipeline_config.cfg -o output_dir
Now if you run the schedule step (without the submit flag, schedule acts as a checker) you should see something like
mavis schedule -o output_dir/
MAVIS: 1.8.4
[2018-06-01 12:19:31] arguments
command = 'schedule'
log = None
log_level = 'INFO'
output = 'output_dir/'
resubmit = False
submit = False
[2018-06-01 12:19:31] validate
MV_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-1 is NOT SUBMITTED
MV_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-2 is NOT SUBMITTED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-1 is NOT SUBMITTED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-2 is NOT SUBMITTED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-3 is NOT SUBMITTED
[2018-06-01 12:19:31] annotate
MA_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-1 is NOT SUBMITTED
MA_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-2 is NOT SUBMITTED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-1 is NOT SUBMITTED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-2 is NOT SUBMITTED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-3 is NOT SUBMITTED
[2018-06-01 12:19:31] pairing
MP_batch-s4W2Go4tinn49nkhSuusrE is NOT SUBMITTED
[2018-06-01 12:19:31] summary
MS_batch-s4W2Go4tinn49nkhSuusrE is NOT SUBMITTED
rewriting: output_dir/build.cfg
Adding the submit argument will start the pipeline
mavis schedule -o output_dir/ --submit
After this completes, run schedule without the submit flag again and you should see something like
MAVIS: 1.8.4
[2018-06-01 13:15:28] arguments
command = 'schedule'
log = None
log_level = 'INFO'
output = 'output_dir/'
resubmit = False
submit = False
[2018-06-01 13:15:28] validate
MV_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-1 (zQJYndSMimaoALwcSSiYwi) is COMPLETED
MV_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-2 (BHFVf3BmXVrDUA5X4GGSki) is COMPLETED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-1 (tUpx3iabCrpR9iKu9rJtES) is COMPLETED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-2 (hgmH7nqPXZ49a8yTsxSUWZ) is COMPLETED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-3 (cEoRN582An3eAGALaSKmpJ) is COMPLETED
[2018-06-01 13:15:28] annotate
MA_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-1 (tMHiVR8ueNokhBDnghXYo6) is COMPLETED
MA_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-2 (AsNpNdvUyhNtKmRZqRSPpR) is COMPLETED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-1 (k7qQiAzxfC2dnZwsGH7BzD) is COMPLETED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-2 (dqAuhhcVKejDvHGBXn22xb) is COMPLETED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-3 (eB69Ghed2xAdp2VRdaCJBf) is COMPLETED
[2018-06-01 13:15:28] pairing
MP_batch-s4W2Go4tinn49nkhSuusrE (6LfEgBtBsmGhQpLQp9rXmi) is COMPLETED
[2018-06-01 13:15:28] summary
MS_batch-s4W2Go4tinn49nkhSuusrE (HDJhXgKjRmseahcQ7mgNoD) is COMPLETED
rewriting: output_dir/build.cfg
run time (hh/mm/ss): 0:00:00
run time (s): 0
If you see the above, then MAVIS has completed correctly!
MAVIS (Full) Tutorial¶
The following tutorial is an introduction to running MAVIS. You will need to download the tutorial data. Additionally the instructions pertain to running MAVIS on a SLURM cluster. This tutorial will require more resources than the MAVIS (Mini) Tutorial above.
Getting the Tutorial Data¶
The tutorial data can be downloaded from the link below. Note that it may take a while as the download is ~29GB
tar -xvzf tutorial_data.tar.gz
The expected contents are
Path |
Description |
Information regarding the other files in the directory | |
The events that we expect to find, either experimentally validated or ‘spiked’ in |
L1522785992_normal.sorted.bam |
Paired normal library BAM file |
L1522785992_normal.sorted.bam.bai |
BAM index |
L1522785992_trans.sorted.bam |
Tumour transcriptome BAM file |
L1522785992_trans.sorted.bam.bai |
BAM index file |
L1522785992_tumour.sorted.bam |
Tumour genome BAM file |
L1522785992_tumour.sorted.bam.bai |
BAM index file |
breakdancer-1.4.5/ |
Contains the BreakDancer output which was run on the tumour genome BAM file |
breakseq-2.2/ |
Contains the BreakSeq output which was run on the tumour genome BAM file |
chimerascan-0.4.5/ |
Contains the ChimeraScan output which was run on the tumour transcriptome BAM file |
defuse-0.6.2/ |
Contains the deFuse output which was run on the tumour transcriptome BAM file |
manta-1.0.0/ |
Contains the Manta output which was run on the tumour genome and paired normal genome BAM files |
Downloading the Reference Inputs¶
Run the following to download the hg19 reference files and set up the environment variables for configuring MAVIS
source reference_inputs/
Generating the Config File¶
The config command does most of the work of creating the config for you but there are a few things you need to tell it
Where your bams are and what library they belong to
--library L1522785992-normal genome normal False tutorial_data/L1522785992_normal.sorted.bam
--library L1522785992-tumour genome diseased False tutorial_data/L1522785992_tumour.sorted.bam
--library L1522785992-trans transcriptome diseased True tutorial_data/L1522785992_trans.sorted.bam
Where your SV caller output files (events) are
If they are raw tool output as in the current example you will need to use the convert argument to tell MAVIS the file type
--convert breakdancer tutorial_data/breakdancer-1.4.5/*txt breakdancer
--convert breakseq tutorial_data/breakseq-2.2/breakseq.vcf.gz breakseq
--convert chimerascan tutorial_data/chimerascan-0.4.5/chimeras.bedpe chimerascan
--convert defuse tutorial_data/defuse-0.6.2/results.classify.tsv defuse
--convert manta tutorial_data/manta-1.0.0/diploidSV.vcf.gz tutorial_data/manta-1.0.0/somaticSV.vcf manta
For older versions of MAVIS the convert command may require the path to the file(s) be quoted and the strandedness be specified (default is False)
Which events you should validate in which libraries
For this example, because we want to determine which events are germline/somatic we are going to pass all genome calls to both genomes. We can use either full file paths (if the input is already in the standard format) or the alias from a conversion (the first argument given to the convert option)
--assign L1522785992-trans chimerascan defuse
--assign L1522785992-tumour breakdancer breakseq manta
--assign L1522785992-normal breakdancer breakseq manta
Putting this altogether with a name to call the config, we have the command to generate the pipeline config. You should expect this step with these inputs to take about ~5GB memory.
mavis config \
--library L1522785992-normal genome normal False tutorial_data/L1522785992_normal.sorted.bam \
--library L1522785992-tumour genome diseased False tutorial_data/L1522785992_tumour.sorted.bam \
--library L1522785992-trans transcriptome diseased True tutorial_data/L1522785992_trans.sorted.bam \
--convert breakdancer tutorial_data/breakdancer-1.4.5/*txt breakdancer \
--convert breakseq tutorial_data/breakseq-2.2/breakseq.vcf.gz breakseq \
--convert chimerascan tutorial_data/chimerascan-0.4.5/chimeras.bedpe chimerascan \
--convert defuse tutorial_data/defuse-0.6.2/results.classify.tsv defuse \
--convert manta tutorial_data/manta-1.0.0/diploidSV.vcf.gz tutorial_data/manta-1.0.0/somaticSV.vcf manta \
--assign L1522785992-trans chimerascan defuse \
--assign L1522785992-tumour breakdancer breakseq manta \
--assign L1522785992-normal breakdancer breakseq manta \
-w mavis.cfg
Setting Up the Pipeline¶
The next step is running the setup stage. This will perform conversion, clustering, and creating the submission scripts for the other stages.
mavis setup mavis.cfg -o output_dir/
At this stage you should have something that looks like this. For simplicity not all files/directories have been shown.
|-- build.cfg
|-- converted_inputs
| |--
| |--
| |--
| |--
| `--
|-- L1522785992-normal_normal_genome
| |-- annotate
| | |-- batch-aUmErftiY7eEWvENfSeJwc-1/
| | `--
| |-- cluster
| | |--
| | |--
| | |-- clusters.bed
| | |--
| | `-- MAVIS-batch-aUmErftiY7eEWvENfSeJwc.COMPLETE
| `-- validate
| |-- batch-aUmErftiY7eEWvENfSeJwc-1/
| `--
|-- pairing
| `--
`-- summary
Submitting Jobs to the Cluster¶
The last step is simple, ssh to your head node of your SLURM cluster (or run locally if you have configured remote_head_ssh) and run the schedule step. This will submit the jobs and create the dependency chain
ssh head_node
mavis schedule -o output_dir --submit
The schedule step also acts as a built-in checker and can be run to check for errors or if the pipeline has completed.
mavis schedule -o output_dir
This should give you output something like below (times may vary) after your run completed correctly.
MAVIS: 2.0.0
[2018-06-02 19:47:56] arguments
command = 'schedule'
log = None
log_level = 'INFO'
output = 'output_dir/'
resubmit = False
submit = False
[2018-06-02 19:48:01] validate
MV_L1522785992-normal_batch-aUmErftiY7eEWvENfSeJwc (1701000) is COMPLETED
200 tasks are COMPLETED
run time: 609
MV_L1522785992-tumour_batch-aUmErftiY7eEWvENfSeJwc (1701001) is COMPLETED
200 tasks are COMPLETED
run time: 669
MV_L1522785992-trans_batch-aUmErftiY7eEWvENfSeJwc (1701002) is COMPLETED
23 tasks are COMPLETED
run time: 1307
[2018-06-02 19:48:02] annotate
MA_L1522785992-normal_batch-aUmErftiY7eEWvENfSeJwc (1701003) is COMPLETED
200 tasks are COMPLETED
run time: 622
MA_L1522785992-tumour_batch-aUmErftiY7eEWvENfSeJwc (1701004) is COMPLETED
200 tasks are COMPLETED
run time: 573
MA_L1522785992-trans_batch-aUmErftiY7eEWvENfSeJwc (1701005) is COMPLETED
23 tasks are COMPLETED
run time: 537
[2018-06-02 19:48:07] pairing
MP_batch-aUmErftiY7eEWvENfSeJwc (1701006) is COMPLETED
run time: 466
[2018-06-02 19:48:07] summary
MS_batch-aUmErftiY7eEWvENfSeJwc (1701007) is COMPLETED
run time: 465
parallel run time: 3545
rewriting: output_dir/build.cfg
run time (hh/mm/ss): 0:00:11
run time (s): 11
The parallel run time reported corresponds to the sum of the slowest job for each stage and does not include any queue time etc.