NOT-OD-20-108: Request for Information: Use of Cloud Resources and New File Formats for Sequence Read Archive Data

Request for Information: Use of Cloud Resources and New File Formats for Sequence Read Archive Data

Notice Number:

NOT-OD-20-108

Key Dates

Release Date:

May 19, 2020

Response Date:

July 17, 2020

Related Announcements

None

Issued by

Office of The Director, National Institutes of Health (OD)

Purpose

Background

The Sequence Read Archive (SRA), one of NIH's largest and most diverse datasets, is a broad collection of experimental DNA and RNA sequences that represent genome diversity across the tree of life. The archive currently contains more than 36 petabytes (PB) of data and is continually growing. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), the SRA was copied to Google Cloud Platform (GCP) and Amazon Web Service (AWS) cloud services in 2019 as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. Currently, the SRA data continue to be accessible from NCBI on-premises (on-prem) storage as well.

In 2019 the SRA held nine million records in two formats, original and normalized. The original format is received by NCBI from submitters and is instrument- and experiment-specific; these data were historically stored to tape but are now stored in the cloud. NCBI transforms these original format data into a standard, SRA normalized format for redistribution; this format provides an increasingly efficient presentation of data through compression and task-optimized data formats. The normalized format contains base quality scores (BQS) that provide information about the quality of the sequence. Because numerous BQS values are possible for each base, the inclusion of BQS drastically increases file size, thus making BQS the largest cost driver for SRA storage in the cloud.

The SRA continues to experience exponential growth in submission rates, and the normalized format data is projected to grow to 43 petabytes by 2023. NIH remains committed to store, preserve, and make SRA available to the community. In considering potential strategies for reducing the quickly rising cost of SRA data storage, the size and value of BQS must be considered, as well as the availability of SRA data in "hot" or "cold" storage in the cloud. Data in hot storage are immediately accessible; therefore, hot storage provides greatest value for data that are accessed most frequently. It is the current default standard for new submissions. Depending on the commercial cloud platform, data in cold storage may not be immediately accessible but are stored at a reduced cost. Cloud platforms apply charges to thaw (move) data from cold to hot storage or to access the cold storage data directly. More information about end user costs for accessing SRA data from hot or cold storage is available.

NIH, NLM and the NCBI, and the SRA Data Working Group of the NIH Council of Councils have evaluated SRA growth, cost models for storage over time, and data usage patterns to propose solutions to maintaining SRA data in a format or formats that support(s) efficient retrieval and analysis and maximize(s) research impact. The SRA Data Working Group presented a set of interim recommendations for storing SRA data. These recommendations include a new, hybrid model for SRA data storage and retrieval in the cloud:

BQS should be retained in original format data, and two versions of SRA normalized format data should be maintained: one with BQS and one without them.
In both AWS and GCP clouds, hot storage should contain the full set of normalized data without BQS, as well as the most actively accessed half of normalized data with BQS. Cold storage should contain the less active half of normalized data with BQS, as well as all original format data.
NCBI should determine the appropriate storage location (hot or cold storage) for each dataset deposited into SRA to be provisioned in the cloud by monitoring data usage.
NCBI should provide limits on the amount of data users can request to be thawed without approval to prevent accidental, massive overuse of NIH compute resources (see Table 2 in the interim report for estimated average costs for NIH and users for some typical workflows)

The recommended hybrid storage model for SRA in the cloud emphasizes the role of SRA normalized format data without BQS but leaves open the question of how NIH will design and implement these changes. NCBI is developing a collection of approaches that will be incrementally tested, validated with the community, and applied to the corpus of SRA over time. The initial approach simply removes BQS from the SRA normalized format data, creating a format referred to as ETL-BQS. A second approach goes a step further and achieves further size reduction by converting high coverage reads into sets of compressed multiple sequence alignments and their respective consensus sequences. A detailed description of these data formats is available.

In the future NCBI plans to provide only ETL-BQS format data from its on-prem data storage. Users who want BQS data, in either the normalized format or the original format, will need to access those data from cloud storage.

Information Requested

NIH is requesting input on the use of SRA data to understand how best to manage this resource to facilitate its use in research while controlling costs as it grows in size. NIH would like to better understand how the research community currently uses SRA data, how researchers are using or anticipate using cloud computing with SRA data, and which formats of SRA data are most valuable to the research community.

The NIH seeks comments on any or all of the following topics:

How researchers are currently engaging with SRA, considering:
1. Pipelines and tools researchers are using with SRA data.
2. Formats of SRA data required for current analyses, particularly whether and how BQS are used.
The potential usability and usefulness of SRA normalized data format without BQS (detailed description of these data formats).
Possible use cases for new formats with SRA read data stored in alignments without BQS.
Specific value to the user of having original format (as submitted) SRA data available for research.
Whether SRA data users are currently using or planning to use SRA data in the cloud and the factors influencing that decision, such as tools, components, or accessories that would facilitate using the data in the cloud.
How the proposed hybrid model for SRA data storage and retrieval in the cloud would impact current or future research workflows.
Any other topics that NIH might consider to maximize the use and value of SRA in the cloud.

Submitting a Response

Comments should be submitted electronically at: https://datascience.nih.gov/sra-rfi-submission

This RFI is for planning purposes only and should not be construed as a policy, solicitation for applications, or as an obligation on the part of the government to provide support for any ideas identified in response to it. Please note that the government will not pay for the preparation of any information submitted or for its use of that information.

Responses may be compiled and shared publicly in an unedited version after the close of the comment period. Please do not include any proprietary, classified, confidential, or sensitive information in your response. The government reserves the right to use any non-proprietary technical information in summaries of the state of the science and any resultant solicitation(s). The NIH may use information gathered by this RFI to inform development of future guidance and policy directions.

We look forward to your input and hope you will share this RFI with your colleagues.

Inquiries

Please direct all inquiries to:

Christopher O'Sullivan, Ph.D.
National Center for Biotechnology Informatics,
National Library of Medicine

Jessica Mazerik, Ph.D.
Office of Data Science Strategy
Division of Program Coordination, Planning, and Strategic Initiatives
Office of the Director
[email protected]