ENA submission guide#

Upload to ENA (Jon's method)#

https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html

Overview#

  1. Access ENA account
  2. Register a study
  3. Register samples
  4. Upload reads

Lab ENA ftp Account#

Check with members of the lab for account info. You will need it to submit files to ENA.

Register a study#

  1. Log into ENA Webin Submissions Portal
  2. Select Register Study. hihi

Pre-Register Samples and obtain sample accession#

To note#

  1. Note that this step in necessary to let ENA know how many sample reads will be uploaded and the metadata associated with the samples.
  2. It essentially creates a container for reads.
  3. No project accession is assigned at this stage. It'll be added in the manifest file in the next step.

Steps#

  1. Log into ENA Webin Submissions Portal.
  2. Download a template spreadsheet to register samples.
    1. Select other checklist -> ENA default sample checklist. You can pick other checklist as well. hihi
  3. In the excel spreadsheet, fill in the file tax_id, scientific_name, sample_alias, sample_title sample_description
    1. Let sample_alias be sample_id. You can choose to leave it blank and ENA will assign a random ID.
    2. Let sample_title be sample_id (WEB123 etc.)
    3. You can check accepted tax_id and scientific_name using this guide
    4. Some useful scientific_name include metagenome and unidentified bacteria
  4. Save the excel file as tab-separated txt and manually convert the extension to .tsv.
  5. Upload to the same page where you downloaded the template (See figure above).
  6. ENA will generate an accession number for each sample in a separate file that'll be downloaded automatically.

You have to option to submit via programmatic submission.

Prepare upload manifest file and webin command#

We will use webin-cli to upload the read files, which reqires as input the manifest file and the command file. Guide

  1. Create an aggregated folder where each subfolder corresponds to a sample and contains all the reads from that sample.
  2. The python script below generates a manifest file and command.sh file in each sample's folder. (e.g. WEB123/ENA/manifest.txt & WEB123/ENA/command.sh)
  3. Find all command.sh and run them to initiate the upload. It'll take a while.
  4. Add the -ascp tag to improve upload speed.
#!/usr/bin/env python3
import pandas as pd
import os
import re

# %% To customize
isolatefolder = "/home/jon/GIS/nea/data/processed/ENA_submission"
# sampleAccessionFile = os.path.expanduser(isolatefolder+'/checklist.tsv')
sampleAccessionFile = "/home/jon/GIS/nea/scripts/upload_to_ENA/Webin-accessions-****" # Check if the format of the results is as expected
studyAccession = "PRJEB****"
suffix = 'fastq.gz' # or fq.gz or others
# Also check column name of sampleAccessionFile

# %% Set up
ck = pd.read_csv(sampleAccessionFile, sep="\t")
# Find fastq files in folder
def find_file(folder,suffix=suffix):
    for root, dirs, files in os.walk(folder):
        for file in files:
            if re.match(r".*1.{}".format(suffix), file):
                file1 = file
            elif re.match(r".*2.{}".format(suffix), file):
                file2 = file
        return file1, file2

# %% Generate command.sh and manifest.txt
for ind, i in ck.iterrows():
    folder = isolatefolder + f"/{i['ALIAS']}"
    expanded_folder = os.path.expanduser(folder)
    if not os.path.exists(expanded_folder):
        continue
    file1, file2 = find_file(expanded_folder, suffix = "fq.gz")
    manifest_file = f"{expanded_folder}/ENA/manifest.txt"
    manifest = (
        f"STUDY {studyAccession}\n"
        f"SAMPLE {i['ACCESSION']}\n"
        f"NAME {i['ALIAS']}\n"
        "INSTRUMENT Illumina HiSeq X\n"
        "LIBRARY_SOURCE METAGENOMIC\n"
        "LIBRARY_SELECTION RANDOM\n"
        "LIBRARY_STRATEGY WGS\n"
        f"FASTQ {file1}\n"
        f"FASTQ {file2}"
    )
    command_file = f"{expanded_folder}/ENA/command.sh"
    command = f"java -jar ~/GIS/code/webin-cli-7.0.1.jar -ascp -userName Webin-***** -password ***** -context reads -manifest {folder}/ENA/manifest.txt -inputDir {folder} -outputDir {folder}/ENA -submit"
    try:
        os.mkdir(f"{expanded_folder}/ENA")
    except FileExistsError:
        pass
    with open(manifest_file, "w+") as f:
        f.writelines(manifest)
    with open(command_file, "w+") as f:
        f.writelines(command)

Download with aspera (Download ENA)#

  1. Set up aspera
sudo apt-get install ruby-dev
sudo gem install aspera-cli
  1. Find the folder and files
  2. Download using the following command
ascp -QT -l 300m -P 33001 -i /home/jon/.aspera/ascli/sdk/aspera_bypass_dsa.pem era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/ERR164/ERR164407/ERR164407.fastq.gz .
  1. To download multiple files, download the file report to obtain the ftp address for the files of interest.
  2. Write a simple script to parse the file
    1. Make sure that the field containing the SAM id is not in the last column, else you have to deal with ^M$
    2. Check if the field for the SAM id and ftp address are 10 and 8 respectively. If not, change the code respectively.
for i in $(cat downloaded_ENA_file.csv); do
    prefix=$(echo $i | cut -d , -f12)
    echo $prefix
    mkdir $prefix
    file=$(echo $i | cut -d , -f8)
    file1=$(echo $file | cut -d ';' -f1)
    file2=$(echo $file | cut -d ';' -f2)
    command1="ascp -QT -l 300m -P 33001 -i /home/jon/.aspera/ascli/sdk/aspera_bypass_dsa.pem era-fasp@fasp.sra.ebi.ac.uk:${file1#*/} ./$prefix"
    command2="ascp -QT -l 300m -P 33001 -i /home/jon/.aspera/ascli/sdk/aspera_bypass_dsa.pem era-fasp@fasp.sra.ebi.ac.uk:${file2#*/} ./$prefix"
    $command1
    $command2
done

Download to S3 directly#

TBD

Ming Hao's Method#

https://ena-docs.readthedocs.io/en/latest/submit/general-guide/metadata.html (Courtesy of Chia Minghao) Submissions made through Webin are represented using a number of different metadata objects. Before submitting data to ENA, it is important to familiarise yourself with the ENA metadata model and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit.

For example, a publication is typically associated with a study (project), sequenced source material is represented using samples, and sequencing experiment details are captured by the experiment object. Note that data files are also submitted by associating them with metadata objects. Sequence read data is associated with run objects while other data files are associated with analysis objects. The full metadata model with relationships between the different types of objects is illustrated below.

Metadata Model#

  • Study: A study (project) groups together data submitted to the archive and controls its release date. A study accession is typically used when citing data submitted to ENA. Note that all associated data and other objects are made public when the study is released.
  • Sample: A sample contains information about the sequenced source material. Samples are associated with checklists, which define the fields used to annotate the samples. Samples are always associated with a taxon.
  • Experiment: An experiment contains information about a sequencing experiment including library and instrument details.
  • Run: A run is part of an experiment and refers to data files containing sequence reads.
  • Analysis: An analysis contains secondary analysis results derived from sequence reads (e.g. a genome assembly),
  • Submission: A submission contains submission actions to be performed by the archive. A submission can add more objects to the archive, update already submitted objects or make objects publicly available.

Submission via ascp#

https://ena-docs.readthedocs.io/en/latest/submit/fileprep/upload.html#using-aspera-ascp-command-line-program

Steps to follow Log in using lab credentials

  1. Register the study in ENA: https://www.ebi.ac.uk/ena/submit/sra/#home
  2. Go back to new submission -> register samples. Select an appropriate checklist like the minimum template (ERC000011)
  3. Fill in the template, save as a tsv file and upload completed tsv files into ENA. At this stage, sample registration should be complete. Note that just by registering your samples these will not be affiliated with a study or any data. The association of samples with a study happens in subsequent steps when you submit sequence data and point to your sample(s) from experiment object(s).
  4. Upload fastq files and md5 files via ascp and the expect batch script (See next steps)
  5. Create a text file containing the filename or full path of every fastq file and md5 file that you wish to upload. There are multiple ways of doing this. For example:
j=`pwd`
for k in `ls *gz`; do echo $j/$k; done > paths.txt
# OR
for i in `cat ../unique_library_ids.txt`; do ls -1 "$i"*fastp* >> ../paths.txt; done
# OR
ls ./*.gz >> ../paths.txt
  1. To use the expect batch script, first create a screen on ionode so that the uploads will not fail due to bad connection.
  2. Within the screen, change working directory to the folder containing the files to be uploaded (to be safe) and upload files via ascp and the expect batch script, using this command:
expect path/to/ena_submit.exp paths.txt <Webin-ID> <Webin-Password>
#!/usr/bin/expect
#ena_submit.exp

set fofn [lindex $argv 0]
set dropbox [lindex $argv 1]
set pass [lindex $argv 2]
set files [open $fofn]
set subs [read $files]
set direxist 0
set timeout -1

foreach line [split $subs \n] {
  if { "" != $line } {
    set seqfile [exec basename $line]
    set lst [split $line "/"]
    spawn ascp -QT -l200M $line $dropbox@webin.ebi.ac.uk:.
    expect "Password:"
    send "$pass\r"
    expect eof
    wait
    sleep 5
  }
}
  1. The uploads should progress automatically.
  2. Under new submission -> click submit sequence reads and experiments, to finalize submission with metadata and link uploads to said metadata. Note: Novogene uses the HiSeqX10 platform. Double check the insert size for the relevant project, e.g. 300 bp insert size? Beware, the terms Upload =/= submit! Just because files are uploaded, does not mean it is submitted to ENA
  3. ENA will send a warning email if files are corrupt. Reupload corrupted files and correct md5 values see: https://ena-docs.readthedocs.io/en/latest/faq/runs.html#appendix-correcting-an-md5-value
  4. To check if files have been uploaded, enter this command on Aquila >ftp webin.ebi.ac.uk then put username and pw (Lab ENA ftp account). Log in and you can ls the files which are present.

Side note on MD5 files#

As described above, you can register your file’s MD5 value by outputting it to a second file and uploading this along with the data file. Alternatively, you can make a note of the value and enter it when prompted during the submission process.

If you make and upload your own .md5 file, be sure it contains only the 32 digit MD5 value for a single file, and that its name matches the name of that file. (use md5sum on Aquila)

An example of an ascp command to upload a fastq file would look like this, but it would be tedious to type this for 100s of files:

ascp -QT -l200M-L- MHH1060_fastpdecont_1.fastq.gz Webin-42697@webin.ebi.ac.uk:.

Misc ENA rules after uploading#

The fastq files will be automatically removed on ftp, after ENA submission is complete.

The data upload areas are provided as a temporary place in which data are held while in transit. As such, they are neither intended nor suitable for any longer-term storage of data. Such storage is provided in ENA itself. Once in ENA, data can be released immediately following submission or can be held confidential prior to analysis and literature publication if required. We expect any given data file to remain in a data upload area for no longer than 2 months before the instruction is given by the user to submit the file. While we attempt to remind users of this policy at the 2 months time point we reserve the right to routinely delete any data files that persist in them for more than 2 months.

We place no absolute limit within the 2-month period on the total volume of user data that may exist in a data upload area at any one time and are keen to accommodate the largest submissions where possible.

Misc notes on using expect to automate uploads (a bit messy)#

  • https://bergmanlab.uga.edu/blog-2/
  • https://stackoverflow.com/questions/19774016/terminating-spawn-sessions-in-expect (for future consideration)
  • https://stackoverflow.com/questions/7619731/too-many-programs-spawned-with-expect

exit [-opts] [status] causes Expect to exit or otherwise prepare to do so. The -onexit flag causes the next argument to be used as an exit handler. Without an argument, the current exit handler is returned. The -noexit flag causes Expect to prepare to exit but stop short of actually returning control to the operating system. The user-defined exit handler is run as well as Expect's own internal handlers. No further Expect commands should be executed. This is useful if you are running Expect with other Tcl extensions. The current interpreter (and main window if in the Tk environment) remain so that other Tcl extensions can clean up. If Expect's exit is called again (however this might occur), the handlers are not rerun.

Upon exiting, all connections to spawned processes are closed. Closure will be detected as an EOF by spawned processes. exit takes no other actions beyond what the normal _exit(2) procedure does. Thus, spawned processes that do not check for EOF may continue to run. (A variety of conditions are important to determining, for example, what signals a spawned process will be sent, but these are system-dependent, typically documented under exit(3).) Spawned processes that continue to run will be inherited by init.

status (or 0 if not specified) is returned as the exit status of Expect. exit is implicitly executed if the end of the script is reached.

wait [args] delays until a spawned process (or the current process if none is named) terminates.

wait normally returns a list of four integers. The first integer is the pid of the process that was waited upon. The second integer is the corresponding spawn id. The third integer is -1 if an operating system error occurred, or 0 otherwise. If the third integer was 0, the fourth integer is the status returned by the spawned process. If the third integer was -1, the fourth integer is the value of errno set by the operating system. The global variable errorCode is also set.

Additional elements may appear at the end of the return value from wait. An optional fifth element identifies a class of information. Currently, the only possible value for this element is CHILDKILLED in which case the next two values are the C-style signal name and a short textual description.

https://www.programmersought.com/article/4877445834/ The wait command is indispensable for expect scripts. It is responsible for collecting the “corpse” of child processes, to avoid accumulating zombie processes that jam up the process list.

ftp webin.ebi.ac.uk #then put username and pw

#create file of filenames
j=`pwd`
for k in `ls *gz`; do echo $j/$k; done > fofn

# OR

for i in `cat ../temp.txt`; do ls -1 "$i"*fastp* >> ../fofn; done

#Perform ASCP submission, note: replace with your ENA (Webin ID and Password). Working directory is the decont folder.
expect ena_submit.exp fofn Webin-ID Webin-Password


(base) [chiamh@n112 raw]$ ps -aux | less   #to check processes which are running
(base) [chiamh@n112 raw]$ ps -aux | grep chiamh | less