Running nextflow on awsbatch#

Guide

General idea#

  1. Create a job queue
  2. Create a compute environment
    1. You can specify an AMI, or use the default AMI. The primary requirement is that docker and awscli is installed in the AMI.
    2. Guide to creating AMI. Follow the guide to configure docker, storage size and install miniconda + awscli.
  3. Attach compute environment to queue.
    1. Each queue should only be tied to one compute environment (ce).
    2. Create a queue+ce pair for each program in the pipeline.
    3. Alternative, create multiple queue+ce pairs with different compute strength and link each process to the appropriate queue+ce pair.
  4. Configure nextflow (see below). Each process in the pipeline should be tied to a specific queue+ce.
  5. Specify files and folders on s3 like how you would locally.
    1. You can technically mix and match between local files and s3 files. All the files will be staged on an s3 bucket specified with -bucket-dir option. i.e. the work folder is on s3.
    2. For cost purposes, nextflow work folder should not be on a local drive so that no files are downloaded locally. File transfer outside of S3 is expensive.

#

Load configuration profile -profile batch, which should load all parameters specific to s3

batch {
  process.executor = 'awsbatch'
}
#or
process {
    withName: HUMANN3 {
        queue = 'jon-bioinfo-queue'
        cpus = 8
        memory = 32.GB
    }
    withName: TEST {
        queue = 'jon-bioinfo-queue'
        cpus = 8
        memory = 32.GB
    }
    executor = 'awsbatch'
}

aws {
    batch {
        // NOTE: this setting is only required if the AWS CLI tool is installed in a custom AMI
        cliPath = '/home/ec2-user/miniconda/bin/aws'
    }
}

Use option -bucket-dir to specify temp work directory on s3 for nextflow temp files

nextflow run test.nf -profile batch -bucket-dir s3://jon-nextflow-work --procdir=s3://jon-nextflow-work

Notes#

  1. Don't use the container amazon/aws-cli because it has an entrypoint that requires you to run is as a command. While you can technically overwrite the entrypoint with docker run, this option is not available for nextflow + aws batch.link
  2. Don't use new File("pathtodir") for s3 files and directories. Use file("pathtodir") instead. link Useful for checking if files exist.
  3. To ssh into a running EC instance, go to EC2 console, look for running instances, find the ip address, and ssh into it with user ec2-user. Make sure the computing environment is attached to an EC2 keypair.

Deleting filtes on aws s3#

# Checking files to be filtered
aws --profile gisawsconsole s3 rm s3://csb5-test-reads/data/processed/ --dryrun --recursive --exclude "*" --include "*/test/"

# Remove files based on filter
aws --profile gisawsconsole s3 rm s3://csb5-test-reads/data/processed/ --dryrun --recursive --exclude "*" --include "*/test/"