September 24, 2021
Office of Data Science Strategy (ODSS)
NIH requests input on use cases, expectations, and capabilities for enabling data and information discovery (search) to enhance NIH-wide data discovery and its reuse. In particular, NIH would like to understand better researchers’ experiences and expectations in finding data and information that would greatly enable their research capabilities and uncover underlying barriers or opportunities that can move the search landscape forward. Building on prior efforts, NIH seeks to identify how production level capabilities can be provided for search, assuming that such plans are realistic from the point of view of data submitters as they fulfill the requirements of the recently issued NIH Policy for Data Management and Sharing (DMS Policy). The information gathered from this Request for Information (RFI) is intended to inform and frame the discussions for an NIH supported workshop in January 2022. Invitations may be drawn from the larger research community, including responders to this RFI.
NIH stores and facilitates access to many datasets, both open and controlled, with the goal of accelerating scientific discovery, thereby maximizing taxpayer return on investment in the collection of these datasets. Biomedical researchers are interested in finding relevant data and documentation across data repositories to create computational models or to allow for additional analysis. Clinical researchers are interested in building and exploring new cohorts by searching patient attributes across controlled-access repositories. Researchers may have specific questions that require complex informational retrieval. All of these use cases require a data-focused capability that includes data-in-context information and could include non-textual search capabilities, with an eye to the user experience and expectation.
To support the sharing of data, effective January 25th, 2023, the NIH Data Management and Sharing (DMS) Policy establishes the expectations for sharing of scientific data generated from NIH-funded or conducted research. All NIH-supported research that generates scientific data will be required to include a Data Management and Sharing Plan and to carry through that plan as the award proceeds. With this increase in data availability, the demands on investigators to make data discoverable to other researchers will increase as well.
Scope of Search
Search is often equated with the capabilities provided by web search engines. However, in use cases sought from several sources (bioCADDIE, NCI Semantic Infrastructure Workshop) users identified questions which may or may not be suitable for search engines or require additional ‘reach-in’ to a dataset. Rather than declare these out of scope it is more responsive to hear and accommodate the needs expressed by the biomedical community. Analysis of these responses identified the following three categories of Search, all of which are in scope for this RFI.
Dataset Discovery: this type of search focuses on trying to find datasets or entire collections on a particular topic. Finding relevant dataset(s) is usually not an end in itself but is a key intermediate step in the integration and reuse of existing data to inform new scientific hypotheses. Dataset discovery does not extend to querying specific records within the dataset, relying on attributes recorded at the dataset level, including attributes e.g. age range covered, drugs or therapies tested, races and/or genders covered.
Cohort Building: In many use case scenarios, this is the process of finding and pulling together sets of subjects with a given set of attributes that can be used for some type of comparative analysis. We also identified other ‘cohort building-like’ use cases that do not refer to patient-based cohorts but rather to the nature of pulling together data from multiple sources into one source for subsequent analysis. In this case searching on attributes within the dataset is necessary. Typically, this is beyond what a search engine would do, but within scope of the use cases scientists articulated for search.
Knowledge-Based Search: This type of search focuses on identifying facts and pieces of connected “knowledge”, often of reference entities (e.g., molecular weight of a protein, indications for a drug) in a quick and low-effort manner that does not require additional data manipulation and analysis.
The need for and existence of specialized search beyond text or numeric data is recognized. How specialist searches on, for example, image content, genetic or peptide sequences, and chemical structures is encouraged as a forward-looking approach to an overall search scheme. The details of how such searches are conducted is beyond the scope of information sought here.
NIH has previously funded several activities around data discovery and issued an earlier RFI in 2013 seeking information relevant to finding data. Pilot initiatives have been conducted, identifying, and evaluating multiple solutions for indexing, for metadata standards and schemas, and cross-repository discovery engines. A workshop on discoverability of data in generalist repositories was also held in 2020. This RFI seeks information focused on building on prior efforts, to factors relevant to providing production level capability that is continuously available, updated, and scientifically reliable. Assuming the availability of technology is not a barrier, the factors behind production characteristics may be related to the kinds of skills and organizations required to provide such capability. Other factors may be concerned with uptake in the community, e.g., knowledge of the relevant catalogs, tools, and processes for managing and submitting data, training of scientists, etc.
Request for Information
NIH is soliciting input from all interested stakeholders, including researchers, clinicians, academic institutions, medical institutions, health and IT developers, professional organizations or societies, as well as other interested members of the public. Organizations are strongly encouraged to submit a single response that reflects the view of their organization and membership as a whole.
The NIH seeks comments on any or all of, but not limited to, the following topics:
From the perspective of a data submitter/generator, a data user, or a technology provider:
This RFI is for planning purposes only and should not be construed as a policy, solicitation for applications, or as an obligation on the part of the Government to provide support for any ideas identified in response to it. Please note that the Government will not pay for the preparation of any information submitted or for its use of that information.
Responses may be compiled and shared publicly in an unedited version after the close of the comment period. Please do not include any proprietary, classified, confidential, or sensitive information in your response. The Government reserves the right to use any non-proprietary technical information in summaries of the state of the science, and any resultant solicitation(s). The NIH may use information gathered by this RFI to inform development of future guidance and policy directions.
How to Submit a Response
All comments must be submitted electronically on the https://datascience.nih.gov/search-capabilities-across-the-biomedical-landscape.
Responses must be received by 11:59:59 pm (ET) on 12/03/2021.
Responses to this RFI are voluntary and may be submitted anonymously. You may voluntarily include your name and contact information with your response. If you choose to provide NIH with this information, NIH will not share your name and contact information outside of NIH unless required by law.
Other than your name and contact information, please do not include any personally identifiable information or any information that you do not wish to make public. Proprietary, classified, confidential, or sensitive information should not be included in your response. The Government will use the information submitted in response to this RFI at its discretion. Other than your name and contact information,the Government reserves the right to use any submitted information on public websites, in reports, in summaries of the state of the science, in any possible resultant solicitation(s), grant(s), or cooperative agreement(s), or in the development of future funding opportunity announcements.This RFI is for informational and planning purposes only and is not a solicitation for applications or an obligation on the part of the Government to provide support for any ideas identified in response to it. Please note that the Government will not pay for the preparation of any information submitted or for use of that information.
We look forward to your input and hope that you will share this RFI opportunity with your colleagues.
Susan Gregurick, Ph.D.
Office of Data Science Strategy
Division of Program Coordination, Planning, and Strategic Initiatives
Office of the Director
Telephone: (301) 435-1923