October 27, 2023
National Cancer Institute (NCI)
The purpose of this Request for Information (RFI) is to solicit broad community input on the artificial intelligence-readiness (i.e., data representation in a format that enables and eases the application of artificial intelligence/machine learning (AI/ML) approaches) of data across the multiple components of NCIs Cancer Research Data Commons (CRDC). This will be achieved by identifying a) AI/ML use cases for current CRDC components/datasets, b) AI-readiness gaps in each of the CRDC components, and c) recommendations for preparing future AI-ready data.
There has been fast-paced growth in the application of artificial intelligence (AI) in cancer research. Despite this growth, there are challenges in both collecting and generating data in a way that makes it easily accessible and usable for AI/ML applications while maintaining security and data quality. Data which is accessible and usable for AI/ML applications is referred to as AI-ready data. AI-ready data can lead to the development of well-validated AI/ML models that can be deployed for research and improvement of healthcare. AI-readiness encompasses various characteristics, including completeness of the data (e.g., sufficient volume and managed missing values), incorporation of data standards (e.g., utilizing ontologies and terminologies whenever possible), computable formats, documentation (process and intent in generating the data), data annotations, data balance, data privacy, data security, among other features.
The National Cancer Institute (NCI) makes a wealth of cancer research data findable, accessible, interoperable, and reusable (FAIR) to research communities through repositories and other resources, including the NCI Cancer Research Data Commons (CRDC). The CRDC is a scientific, cloud-based infrastructure comprised of eleven components that connect NCI-funded cancer research datasets with analytics tools to facilitate access to cancer research data.
Of the eleven components, five data commons house cancer research data with a particular modality, species, or general purpose. The five data commons are the Genomic Data Commons (GDC), Proteomic Data Commons (PDC), Integrated Canine Data Commons (ICDC), Imaging Data Commons (IDC), and Cancer Data Service (CDS). These data commons were developed independently and therefore have distinct data models, data dictionaries, and semantics standards which hinder their interoperability at this time. However, efforts are underway to bring all datasets together, so they are interoperable as a unified repository. The other CRDC components include three Cloud Resources: Data Standards Services (DSS), Data Commons Framework (DCF), and Cancer Data Aggregator (CDA). The Cancer Data Aggregator enables querying across the data commons to promote discovery.
Comments must be received no later than 11:50pm Eastern Standard Time on November 30, 2023
NCI invites input on how the research community currently uses the CRDC for artificial intelligence/machine learning (AI/ML)-related use cases and prospective AI/ML use cases. Feedback is welcome from a diverse set of professionals, including researchers, scientists, administrators, and healthcare professionals. Respondents can be members of academia, government, or industry. Prospective users who have yet to utilize CRDC resources are also invited to provide feedback on any barriers which hinder access to or use of CRDC data as part of their AI/ML-related research plans. NCI welcomes community input on cancer AI/ML use cases through the utilization of the different CRDC components (data commons, infrastructure, and cloud resources). Responders are encouraged to provide feedback on their experience utilizing the CRDC Data Commons for research.
NCI seeks comments on the subjects listed above and offers the following questions for response:
Questions for Response:
1. Please provide your contact information, including name and email.
2. Please describehow you currently interact with CRDC component(s) or other cancer data (e.g. current use, prospective user, not a user). If you do not use the CRDC, please provide your reason(s) and list the data source(s) and/or repository(ies) you are using.
3. If you are a current or prospective user of the CRDC, please list which component(s) you use (e.g. Genomic Data Commons, Cancer Data Service, etc.).
4. Please identify the AI/ML use case(s) for which you are currently leveraging or planning to leverage CRDC data (e.g., using National Lung Screening Image and Clinical Data from IDC to develop lung-cancer nodule detection algorithms) or other cancer data.
Use Case Definition: A specific research problem that is being addressed/investigated through an AI/ML approach using CRDC data or other cancer data. Sharing your use case(s) will help NCI understand the priority cancer-types, data types, or specific research studies that may need additional support to help the community advance their AI/ML development or algorithm assessments. Providing additional technical details will aid our understanding of the types of analysis or annotations needed to support the different types of AI/ML implementations.
5. Please describe the technical details of the use case(s) implementation, such as learning category (supervised or unsupervised), features, algorithms, data types, etc.
6. What do you consider to be high priority data types (e.g., images in DICOM format) or data elements (e.g., cancer stage) for your AI/ML use case(s)?
7. What type of specialized infrastructure (e.g., computing, storage resources, access, etc.) is required for your AI/ML use case(s)?
8. (Optional) Are you aware of any bias (e.g., data imbalance) in your use of CRDC data or other cancer data for your use case(s)?
9. Have you utilized any of the CRDC Cloud Resources (Broad Institute FireCloud, ISB Cancer Gateway in the Cloud (ISB-CGC), and Seven Bridges (Velsera) Cancer Genomics Cloud (SB-CGC)) to access or analyze CRDC data from a single component, multiple components, and/or to upload your own data? Have you encountered any issues? Please describe.
10. What barrier(s) have you encountered when using CRDC data or other similar cancer datasets for AI/ML applications (e.g. missing data, data integration, computing infrastructure, privacy concerns, etc.)?
11. Please provide additional details on the identified barrier(s).
12. Please elaborate on any additional challenges you have faced when using CRDC data or other cancer data for AI/ML applications (e.g., algorithm development, validation, and testing).
13. Please elaborate on any suggested areas for improvement of the data by NCI or the data sources (e.g., specifics on using certain data type, challenges in multimodal datasets, integrating different datasets or metadata, data access issues, lack of data).
Responses to this RFI will be accepted at https://rfi.grants.nih.gov/?s=65307c3039db1473710b9432
Responders are free to address any or all the questions listed above. The webform is the preferred mode of input, but a file with associated answers may also be sent to NCI_CRDC_AI_Feedback@nih.gov.
Responses will be accepted through 11:59 pm Eastern Standard Time on November 30, 2023.
Responses to this RFI are entirely voluntary and responders are free to address any or all the categories listed above. Please do not include any proprietary, classified, confidential, or sensitive information in your response.
The responses will be reviewed by NIH staff, and individual feedback will not be provided to any responder except as described above. The Government will use the information submitted in response to this RFI at its discretion. The Government reserves the right to use only the processed, anonymized results on public NIH websites, in reports, in summaries of the state of the science, in any possible resultant solicitation(s), grant(s), or cooperative agreement(s), or in the development of future funding opportunity announcements.
This RFI is for information and planning purposes only and shall not be construed as a solicitation, grant, or cooperative agreement, or as an obligation on the part of the Federal Government, the NIH, or individual NIH Institutes and Centers to provide support for any ideas identified in response to it. The Government will not pay for the preparation of any information submitted or for the Governments use of such information. No basis for claims against the U.S. Government shall arise as a result of a response to this RFI or from the Governments use of such information.
NIH looks forward to your input, and we hope that you will share this RFI document with your colleagues.
Emily Greenspan, Ph.D.
Center for Biomedical Informatics & Information Technology (CBIIT)
National Cancer Institute (NCI)