Step-by-step tutorial#

Set up#

Video guide#

Singularity#

module load singularity # Load Singularity
singularity run docker://sylabsio/lolcow # Test with container lolcow
singularity run docker://macadology/humann3 humann --help # Check if humann works
# Generate sif images of relevant docker containers
singularity build $HOME/singularity/humann3.sif docker://macadology/humann3
singularity build $HOME/singularity/kraken.sif docker://staphb/kraken2:2.1.2-no-db

Nextflow#

nextflow
git clone https://github.com/macadology/metagenomic_nf.git

aws-cli#

#tbd

Jobs#

Sample job script

# Submit jobs
sbatch <job-script>

# Check jobs status
squeue -u <username>

Data Management#

Each individual is allocated 1.5tb on $HOME/scratch and 200gb in $HOME. The cluster nodes only have access to data on scratch, so all data including databases should be on scratch. However, files in scratch folders gets removed periodically, so data should be moved to the DATA folder once the analysis is complete to avoid data loss. Note: Even if you place softlink to DATA in the scratch folder, because the nodes aren't able to access DATA, the links break. To check quota, use the following command:

mmlsquota -u <username> --block-size=auto fs-hpc
du -ah --max-depth=1 .

Managing intermediate file#

Guide To work around the space limits of ACRC, keep files in ~/DATA. Let nextflow copy the input files to ~/scratch, run the pipeline, and delete the input/intermediary files afterwards (copy, run, delete). There are a couple issues however. 1) The cluster does not have access to files on ~/DATA (a.k.a SOD). * Therefore, the copy must be run locally, while the run and delete steps can be run on the cluster. * Each process can only be run on a single tytpe of executor, so copy, run and delete must be run as separate processes. * Because nextflow stages files as symbolic links by default, set stageInMode as copy to copy files from SOD to scratch. 2) Since each step is run as separate processes, parallelizing is not straightforward. The pipeline will copy all the data to scratch first and fill up the space, defeating the whole point of this step. * Because there is no way to start another copy of the full pipeline (copy, run, remove) only after the current one ends, instead, set submitRateLimit to limit the number of times the copy process can run in a set amount of time. The time should be roughly the time it takes to run the full pipeline. 3) Nextflow will only clean up the work folder after all the pipeline is completed for all samples. To delete files immediately after an individual sample run is complete, make sure the full pipeline has run, manually search for the copied files in the work folder, and replace them with pseudo empty files.

Bugs#

Executor scope#

The follow scope in the config file (does not work)[https://stackoverflow.com/questions/71210622/problems-with-partitions-slurm-nextflow]. To define cpus, memory, queue and time, specify them in the process scope instead.

executor {
      name = 'slurm'
      queueSize = 3
      cpus = 16
      memory = 32.GB
      queue = 'normal' //normal, express, long
      time = '23:59:59' //hh:mm:ss
}