December 7, 2023
National Cancer Institute (NCI)
The purpose of this Request for Information (RFI) is to solicit broad community input on the artificial intelligence-readiness (i.e., data representation in a format that enables and eases the application of artificial intelligence/machine learning (AI/ML) approaches) of data across the multiple components of NCIs Cancer Research Data Commons (CRDC). This will be achieved by identifying, a) AI/ML use cases for current CRDC components/datasets, b) AI-readiness gaps in each of the CRDC components, and c) recommendations for preparing future AI-ready data.
Use cases received in response to this RFI will be used to inform a data challenge that will be held in early 2024. The data challenge will ask participants to, 1) pre-process CRDC data into an AI-ready format, and 2) develop an AI model for the identified use case. This challenge will give participants an opportunity to provide real-time feedback on their experience with using CRDC data for AI applications.
There has been fast-paced growth in the application of artificial intelligence (AI) in cancer research. Despite this growth, there are challenges in both collecting and generating data in a way that makes it easily accessible and usable for AI/ML applications while maintaining security and data quality. Data which is accessible and usable for AI/ML applications is referred to as AI-ready data. AI-ready data can lead to the development of well-validated AI/ML models that can be deployed for research and improvement of healthcare. AI-readiness encompasses various characteristics, including completeness of the data (e.g., sufficient volume and managed missing values), incorporation of data standards (e.g., utilizing ontologies and terminologies whenever possible), computable formats, documentation (process and intent in generating the data), data annotations, data balance, data privacy, data security, among other features.
The National Cancer Institute (NCI) makes a wealth of cancer research data findable, accessible, interoperable, and reusable (FAIR) to research communities through repositories and other resources, including the NCI Cancer Research Data Commons (CRDC). The CRDC is a scientific, cloud-based infrastructure comprised of eleven components that connect NCI-funded cancer research datasets with analytics tools to facilitate access to cancer research data.
Of the eleven components, five data commons house cancer research data with a particular modality, species, or general purpose. The five data commons are the Genomic Data Commons (GDC), Proteomic Data Commons (PDC), Integrated Canine Data Commons (ICDC), Imaging Data Commons (IDC), and Cancer Data Service (CDS). These data commons were developed independently and therefore have distinct data models, data dictionaries, and semantics standards which hinder their interoperability at this time. However, efforts are underway to bring all datasets together, so they are interoperable as a unified repository. The other CRDC components include three Cloud Resources: Data Standards Services (DSS), Data Commons Framework (DCF), and Cancer Data Aggregator (CDA). The Cancer Data Aggregator enables querying across the data commons to promote discovery.
The response period for this RFI has been extended. Comments must be received no later than 11:50pm Eastern Standard Time on December 31, 2023
NCI invites input on how the research community currently uses the CRDC for artificial intelligence/machine learning (AI/ML)-related use cases and prospective AI/ML use cases. Feedback is welcome from a diverse set of professionals, including researchers, scientists, administrators, and healthcare professionals. Respondents can be members of academia, government, or industry. Prospective users who have yet to utilize CRDC resources are also invited to provide feedback on any barriers which hinder access to or use of CRDC data as part of their AI/ML-related research plans. NCI welcomes community input on cancer AI/ML use cases through the utilization of the different CRDC components (data commons, infrastructure, and cloud resources). Responders are encouraged to provide feedback on their experience utilizing the CRDC Data Commons for research.
NCI seeks comments on the subjects listed above and offers the following questions for response:
Questions for Response:
1. (Optional) Please provide your contact information, including name and email. Please note that name, email address, etc. are optional for a respondent and not required.
2. Please identify the AI/ML use case(s) for which you are currently leveraging or planning to leverage CRDC data (e.g., using National Lung Screening Image and Clinical Data from IDC to develop lung-cancer nodule detection algorithms) or other cancer data.
Use Case Definition: A specific research problem that is being addressed/investigated through an AI/ML approach using CRDC data or other cancer data. Sharing your use case(s) will help NCI understand the priority cancer-types, data types, or specific research studies that may need additional support to help the community advance their AI/ML development or algorithm assessments. Providing additional technical details and describing any required specialized infrastructure will aid our understanding of the types of analysis or annotations needed to support the different types of AI/ML implementations.
3. (Optional) Please describe any bias (e.g. data imbalance) thatyou are aware of in your use of CRDC data or other cancer data for your use case(s).
4. Please describe if and how you have utilized the CRDC Cloud Resources (Broad Institute FireCloud, ISB Cancer Gateway in the Cloud (ISB-CGC), and Seven Bridges (Velsera) Cancer Genomics Cloud (SB-CGC)to access or analyze CRDC data from a single component, multiple components, and/or to upload your own data. Describe any issues thatyou have encountered.
5. Please describe barrier(s) or challenges you have encountered when using CRDC data or other similar cancer datasets for AI/ML applications (e.g. missing data, data integration, computing infrastructure, privacy concerns, etc.). Please elaborate on any suggested areas for improvement.
Responses to this RFI will be accepted at https://rfi.grants.nih.gov/?s=65307c3039db1473710b9432
Responders are free to address any or all the questions listed above. The webform is the preferred mode of input, but a file with associated answers may also be sent to NCI_CRDC_AI_Feedback@nih.gov.
Responses to this RFI are entirely voluntary and responders are free to address any or all the categories listed above. Please do not include any proprietary, classified, confidential, or sensitive information in your response.
Responses will be accepted through 11:59 pm Eastern Standard Time on December 31 , 2023.
The responses will be reviewed by NIH staff, and individual feedback will not be provided to any responder except as described above. The Government will use the information submitted in response to this RFI at its discretion. The Government reserves the right to use only the processed, anonymized results on public NIH websites, in reports, in summaries of the state of the science, in any possible resultant solicitation(s), grant(s), or cooperative agreement(s), or in the development of future funding opportunity announcements.
This RFI is for information and planning purposes only and shall not be construed as a solicitation, grant, or cooperative agreement, or as an obligation on the part of the Federal Government, the NIH, or individual NIH Institutes and Centers to provide support for any ideas identified in response to it. The Government will not pay for the preparation of any information submitted or for the Governments use of such information. No basis for claims against the U.S. Government shall arise as a result of a response to this RFI or from the Governments use of such information.
NIH looks forward to your input, and we hope that you will share this RFI document with your colleagues.
Emily Greenspan, Ph.D.
Center for Biomedical Informatics & Information Technology (CBIIT)
National Cancer Institute (NCI)