Running nextflow on awsbatch#
General idea#
- Create a job queue
- Create a compute environment
- You can specify an AMI, or use the default AMI. The primary requirement is that docker and awscli is installed in the AMI.
- Guide to creating AMI. Follow the guide to configure docker, storage size and install miniconda + awscli.
- Attach compute environment to queue.
- Each queue should only be tied to one compute environment (ce).
- Create a queue+ce pair for each program in the pipeline.
- Alternative, create multiple queue+ce pairs with different compute strength and link each process to the appropriate queue+ce pair.
- Configure nextflow (see below). Each process in the pipeline should be tied to a specific queue+ce.
- Specify files and folders on s3 like how you would locally.
- You can technically mix and match between local files and s3 files. All the files will be staged on an s3 bucket specified with
-bucket-dir
option. i.e. the work folder is on s3. - For cost purposes, nextflow work folder should not be on a local drive so that no files are downloaded locally. File transfer outside of S3 is expensive.
- You can technically mix and match between local files and s3 files. All the files will be staged on an s3 bucket specified with
#
Load configuration profile -profile batch
, which should load all parameters specific to s3
batch {
process.executor = 'awsbatch'
}
#or
process {
withName: HUMANN3 {
queue = 'jon-bioinfo-queue'
cpus = 8
memory = 32.GB
}
withName: TEST {
queue = 'jon-bioinfo-queue'
cpus = 8
memory = 32.GB
}
executor = 'awsbatch'
}
aws {
batch {
// NOTE: this setting is only required if the AWS CLI tool is installed in a custom AMI
cliPath = '/home/ec2-user/miniconda/bin/aws'
}
}
Use option -bucket-dir
to specify temp work directory on s3 for nextflow temp files
nextflow run test.nf -profile batch -bucket-dir s3://jon-nextflow-work --procdir=s3://jon-nextflow-work
Notes#
- Don't use the container amazon/aws-cli because it has an entrypoint that requires you to run is as a command. While you can technically overwrite the entrypoint with docker run, this option is not available for nextflow + aws batch.link
- Don't use
new File("pathtodir")
for s3 files and directories. Usefile("pathtodir")
instead. link Useful for checking if files exist. - To ssh into a running EC instance, go to EC2 console, look for running instances, find the ip address, and ssh into it with user
ec2-user
. Make sure the computing environment is attached to an EC2 keypair.
Deleting filtes on aws s3#
# Checking files to be filtered
aws --profile gisawsconsole s3 rm s3://csb5-test-reads/data/processed/ --dryrun --recursive --exclude "*" --include "*/test/"
# Remove files based on filter
aws --profile gisawsconsole s3 rm s3://csb5-test-reads/data/processed/ --dryrun --recursive --exclude "*" --include "*/test/"