How to create a workflow

CulebrONT allows you to build a workflow using a simple configfile.yaml configuration file :

  • First, provide the data paths

  • Second, activate the requested tools for assembly and correction.

  • Third, activate the tools for quality checking of assemblies.

  • And last, manage the tools parameters.

To create this file, just run:

create_config

Create configfile.yaml for run

culebrONT create_config [OPTIONS]

Options

-c, --configfile <configfile>

Required Path to create configfile.yaml

Then, edit the relevant sections of the file to customize your flavor of a workflow.

1. Providing data

First, indicate the data path in the configfile.yaml configuration file:

DATA:
    FASTQ: '/path/to/fastq/directory/'
    REF: '/path/to/referencefile.fasta'
    GENOME_SIZE: '1m'
    FAST5: '/path/to/fast5/directory/'
    ILLUMINA: '/path/to/illumina/directory/'
    OUTPUT: '/path/to/output/directory/'

Find here a summary table with the description of each data needed to run CulebrONT :

Input

Description

FASTQ

Every FASTQ file should contain the whole set of reads to be assembled per individual. Each fastq file will be assembled independently.

REF

Only one REFERENCE genome file will be used in each CulebrONT run. This REFERENCE will be used for various quality steps (i.e. ASSEMBLYTICS, QUAST and MAUVE)

GENOME_SIZE

Estimated genome size of the assembly can be provided in mega (Mb), giga(Gb) or kilobases (Kb). This size is used by some assemblers (e.g. CANU) and also by the QUAST quality step

FAST5

Medaka needs FAST5 files if a model training step is requested. Please give the path of the FAST5 folder in the FAST5 DATA parameter. Inside this directory, a subdirectory with the exact same name as the corresponding FASTQ (before the .fastq.gz) is required. For instance, if in the FASTQ directory we have run1.fastq.gz and run2.fastq.gz, CulebrONT is expecting the run1/ and run2/ subdirectories in the FAST5 main directory

ILLUMINA

Indicates the path to the directory with Illumina sequence data (in fastq or fastq.gz format), to perform pilon correction, QC or MERQURY QC. Use preferentially paired-end data. All fastq files need to be homogeneous in their extension name. Please use run1_R1 and run1_R2 nomenclature.

OUTPUT

output path directory

Warning

For FASTQ, the naming conventions accepted by CulebrONT are either NAME.fastq.gz or NAME.fq.gz or NAME.fastq or NAME.fq. Use preferentially short names and avoid special characters to avoid report failure. Please do not use the long name provided directly by the sequencing machine.

All fastq files have to be homogeneous on their extension, and can be compressed or not.

Reference fasta file needs a uncompressed .fasta or .fa extension.

2. Choose assemblers, polisher and correctors

Activate/deactivate assemblers, polishers and correctors as you wish, using TRUE/FALSE boolean operators. Feel free to activate only assembly, assembly+polishing or assembly+polishing+correction.

Note

If you expect your genome to include a circular replicon (e.g. with prokaryote), it is recommended to activate CIRCULAR steps

Example:

ASSEMBLY:
    CANU: true
    FLYE: true
    MINIASM: false
    RAVEN: false
    SMARTDENOVO:  false
    SHASTA: false

POLISHING:
    RACON: true

CIRCULAR: false

CORRECTION:
    MEDAKA: false
    PILON: true

FIXSTART: false

3. Choose quality control tools

CulebrONT can use several quality control tools to check assemblies.

  • If BUSCO or QUAST are used, they will run on every fasta assembly generated along the various steps of the pipeline.

  • If BLOBTOOLS, ASSEMBLYTICS, FLAGSTATS, MERQURY are activated, only the fasta assembly generated after the last sequence processing step of the pipeline will be checked.

  • MERQURY quality tools can be activated but illumina reads are mandatory in this case. These reads can be compressed or not.

# BUSCO and QUAST will be launched on all activated steps (ASSEMBLY, POLISHING, CORRECTION)
QUALITY:
    BUSCO: true
    QUAST: true
#### Others quality tools are launched only in last assemblies
    BLOBTOOLS: true
    ASSEMBLYTICS: true
#### Others quality soft but illumina reads are required
    FLAGSTATS: true
    MERQURY : true

If several assemblers are activated, a multiple alignment of the various assemblies for small genomes (<10-20Mbp) can be computed with Mauve.

  • If you want to improve alignment with MAUVE on circular molecules, it is recommended to activate the Fixstart step.

  • Only activate MAUVE if you have more than one assembler per sample, more than one quality step and small genomes.

MSA:
    MAUVE: false

4. Parameters for some specific tools

You can manage tools parameters on the params section in the configfile.yaml file.

Racon` specific options:

  • Racon can be launched recursively from 1 to 9 rounds.

Medaka specific options:

  • If ‘MEDAKA_TRAIN_WITH_REF’ is activated, Medaka launches the training using the reference found in ‘DATA/REF’ path parameter. Medaka will then not take into account other Medaka model parameters and will use the resulting trained model instead.

  • If ‘MEDAKA_TRAIN_WITH_REF’ is deactivated, Medaka does not launch training, but uses instead the model provided in ‘MEDAKA_MODEL_PATH’ parameter. Give to CulebrONT the path of the Medaka model OR just the model name in order to correct assemblies. This parameter could not be empty. Medaka models are automatically downloaded by medaka.

Pilon specific options:

  • We fixed the java memory parameter in the Apptainer.culebront_tools to 8G. If you need to allocate more memory, change this value using sed -i "s/-Xmx1g/-Xmx8g/g" /usr/local/miniconda/miniconda3/envs/pilon/bin/pilon in the Containers/apptainer.culebront_tools.def recipe file before building the apptainer image.

Busco specific options:

  • If BUSCO is activated, you must provide to CulebrONT the path of a Busco database OR only the database name (See the Busco documentation).This parameter cannot be empty.

Blobtools specific options: * Nodes and names from the NCBI taxdump database can be download here : https://github.com/DRL/blobtools#download-ncbi-taxdump-and-create-nodesdb. Please give to culebrONT path to this files

The standard parameters used in CulebrONT are shown below. Feel free to adapt it to your own requirements.

    #### ASSEMBLY
    MINIMAP2:
        PRESET_OPTION: 'map-ont' # -x minimap2 preset option is map-pb by default (map-pb, map-ont etc)
    FLYE:
        MODE : '--nano-raw'
        OPTIONS: '' ## use --scaffold if flye>=2.9 # you can also use --resume option
    CANU:
        MODE : '-nanopore'
        OPTIONS: 'useGrid=false'
    SMARTDENOVO:
        KMER_SIZE: 16
        OPTIONS: '-J 5000'
    SHASTA:
        MEM_MODE: 'filesystem'
        MEM_BACKING: 'disk'
        OPTIONS: '--Reads.minReadLength 0 --config Nanopore-Oct2021'

    #### CIRCULAR
    CIRCLATOR:
        OPTIONS: ''


    #### POLISHING
    RACON:
        RACON_ROUNDS: 2                 #1 to 9


    #### CORRECTION

    MEDAKA:
        MEDAKA_TRAIN_WITH_REF: false    # if 'MEDAKA_TRAIN_WITH_REF' is True, training uses reference to found in DATA REF param.

        # Medaka does not take in count other parameters below if MEDAKA_TRAIN_WITH_REF is TRUE.
        MEDAKA_MODEL_PATH: 'r941_min_high_g303' # use a path if you have downloaded a model (or you want to use your own trained model) OR a simple string like 'r941_min_high_g303'
        MEDAKA_FEATURES_OPTIONS: '--batch_size 10 --chunk_len 100 --chunk_ovlp 10'
        MEDAKA_TRAIN_OPTIONS: '--batch_size 10 --epochs 500 '
        MEDAKA_CONSENSUS_OPTIONS: '--batch 200 '

    PILON:
        PILON_ROUNDS: 2                 #1 to 9
        OPTIONS: ''

    #### QUALITY
    BUSCO:
        #DATABASE: "DATA_DIR/Data-Xoo-sub/bacteria_odb10"
        DATABASE: 'bacteria_odb10 --update-data ' # use a path if you have downloaded a taxonomic database from busco OR a simple string like 'bacteria_odb10'
        MODEL: 'genome'
        SP: ''                         #--augustus-specie parameter on busco

    QUAST:
        GFF: ''
        OPTIONS: '--large'

    DIAMOND:
        DATABASE: 'DATA_DIR/Data-Xoo-sub/testBacteria.dmnd'

    MUMMER:
        MINMATCH: 100                  # is -l option with default 20 on MUMMER
        MINCLUSTER: 500                 # is -c option with default 65 on MUMMER

    ASSEMBLYTICS:
        UNIQUE_ANCHOR_LEN: 10000
        MIN_VARIANT_SIZE: 50
        MAX_VARIANT_SIZE: 10000

    BLOBTOOLS:
        NAMES: 'DATA_DIR/Data-Xoo-sub/blobtools/names.dmp'
        NODES: 'DATA_DIR/Data-Xoo-sub/blobtools/nodes.dmp'

Warning

Please check documentation of each tool (outside of CulebrONT, and make sure that the settings are correct!)


How to run the workflow

Before attempting to run CulebrONT, please verify that you have already modified the configfile.yaml file as explained in 1. Providing data.

run

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
Example:
rattleSNP run -c configfile.yaml –dry-run –jobs 200
rattleSNP run -c configfile.yaml –threads 8 –dry-run
rattleSNP run -c configfile.yaml –apptainer-args ‘–bind /mnt:/mnt’
culebrONT run [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --configfile <configfile>

Required Configuration file for run tool

Default:

Sentinel.UNSET

-pdf, --pdf

Run snakemake with –dag, –rulegraph and –filegraph

Default:

False

Arguments

SNAKEMAKE_OTHER

Optional argument(s)


Advance run

Providing more resources

If the cluster default resources are not sufficient, you can edit the config.yaml file. See 2. Adapting config.yaml:

edit_profile

Edit config.yaml use by profile

culebrONT edit_profile [OPTIONS]

Providing your own tools_path.yaml

To change the tools used in a CulebrONT workflow, you can see 3. How to configure tools_path.yaml

edit_tools

Edit own tools version

culebrONT edit_tools [OPTIONS]

Options

-r, --restore

Restore default tools_config.yaml (from install)

Default:

False


Output on CulebrONT

The architecture of the CulebrONT output is designed as follow:

OUTPUT_CULEBRONT_CIRCULAR/
├── SAMPLE-1
│   ├── AGGREGATED_QC
│      ├── DATA
│      ├── LOGS
│      ├── MAUVE_ALIGN
│      └── QUAST_RESULTS
│   ├── ASSEMBLERS
│      ├── CANU
│         ├── ASSEMBLER
│         ├── CORRECTION
│         ├── FIXSTART
│         ├── POLISHING
│         └── QUALITY
│      ├── FLYE
│         ├── ...
│      ├── MINIASM
│         ├── ...
│      ├── RAVEN
│         ├── ...
│      ├── SHASTA
│         ├── ...
│      └── SMARTDENOVO
│         ├── ...
│   ├── LOGS
│   └── REPORT
├── SAMPLE-2 ...
├── ...
├── ASSEMBLIES
├── config_corrected.yaml
├── REPORT
│   ├── Book
│   └── QMD
├── report.html
└── versions.csv

Report

CulebrONT generates a useful HTML report, including the versions of tools used and, for each fastq, a summary of statistics. Please have a look at example … and enjoy !!

Important

To visualise the report created by CulebrONT, transfer the folder REPORT on your local computer and open it on any web browser.