Office of The Director, National Institutes of Health (OD)
Background
The Sequence Read Archive (SRA), one of NIH's largest and most diverse datasets, is a broad collection of experimental DNA and RNA sequences that represent genome diversity across the tree of life. The archive currently contains more than 36 petabytes (PB) of data and is continually growing. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), the SRA was copied to Google Cloud Platform (GCP) and Amazon Web Service (AWS) cloud services in 2019 as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. Currently, the SRA data continue to be accessible from NCBI on-premises (on-prem) storage as well.
In 2019 the SRA held nine million records in two formats, original and normalized. The original format is received by NCBI from submitters and is instrument- and experiment-specific; these data were historically stored to tape but are now stored in the cloud. NCBI transforms these original format data into a standard, SRA normalized format for redistribution; this format provides an increasingly efficient presentation of data through compression and task-optimized data formats. The normalized format contains base quality scores (BQS) that provide information about the quality of the sequence. Because numerous BQS values are possible for each base, the inclusion of BQS drastically increases file size, thus making BQS the largest cost driver for SRA storage in the cloud.
The SRA continues to experience exponential growth in submission rates, and the normalized format data is projected to grow to 43 petabytes by 2023. NIH remains committed to store, preserve, and make SRA available to the community. In considering potential strategies for reducing the quickly rising cost of SRA data storage, the size and value of BQS must be considered, as well as the availability of SRA data in "hot" or "cold" storage in the cloud. Data in hot storage are immediately accessible; therefore, hot storage provides greatest value for data that are accessed most frequently. It is the current default standard for new submissions. Depending on the commercial cloud platform, data in cold storage may not be immediately accessible but are stored at a reduced cost. Cloud platforms apply charges to thaw (move) data from cold to hot storage or to access the cold storage data directly. More information about end user costs for accessing SRA data from hot or cold storage is available.
NIH, NLM and the NCBI, and the SRA Data Working Group of the NIH Council of Councils have evaluated SRA growth, cost models for storage over time, and data usage patterns to propose solutions to maintaining SRA data in a format or formats that support(s) efficient retrieval and analysis and maximize(s) research impact. The SRA Data Working Group presented a set of interim recommendations for storing SRA data. These recommendations include a new, hybrid model for SRA data storage and retrieval in the cloud:
The recommended hybrid storage model for SRA in the cloud emphasizes the role of SRA normalized format data without BQS but leaves open the question of how NIH will design and implement these changes. NCBI is developing a collection of approaches that will be incrementally tested, validated with the community, and applied to the corpus of SRA over time. The initial approach simply removes BQS from the SRA normalized format data, creating a format referred to as ETL-BQS. A second approach goes a step further and achieves further size reduction by converting high coverage reads into sets of compressed multiple sequence alignments and their respective consensus sequences. A detailed description of these data formats is available.
In the future NCBI plans to provide only ETL-BQS format data from its on-prem data storage. Users who want BQS data, in either the normalized format or the original format, will need to access those data from cloud storage.
Information Requested
NIH is requesting input on the use of SRA data to understand how best to manage this resource to facilitate its use in research while controlling costs as it grows in size. NIH would like to better understand how the research community currently uses SRA data, how researchers are using or anticipate using cloud computing with SRA data, and which formats of SRA data are most valuable to the research community.
The NIH seeks comments on any or all of the following topics:
Submitting a Response
Comments should be submitted electronically at: https://datascience.nih.gov/sra-rfi-submission
This RFI is for planning purposes only and should not be construed as a policy, solicitation for applications, or as an obligation on the part of the government to provide support for any ideas identified in response to it. Please note that the government will not pay for the preparation of any information submitted or for its use of that information.
Responses may be compiled and shared publicly in an unedited version after the close of the comment period. Please do not include any proprietary, classified, confidential, or sensitive information in your response. The government reserves the right to use any non-proprietary technical information in summaries of the state of the science and any resultant solicitation(s). The NIH may use information gathered by this RFI to inform development of future guidance and policy directions.
We look forward to your input and hope you will share this RFI with your colleagues.
Christopher O'Sullivan, Ph.D.
National Center for Biotechnology Informatics,
National Library of Medicine
Jessica Mazerik, Ph.D.
Office of Data Science Strategy
Division of Program Coordination, Planning, and Strategic Initiatives
Office of the Director
[email protected]