Notice Number: NOT-HG-13-014
Release Date: August 8, 2013
Response Due Date: September 6, 2013
National Human Genome Research Institute (NHGRI)
This Request for Information (RFI) is to solicit comments and ideas for the development of analysis methods and software tools, as part of the overall Big Data to Knowledge (BD2K) Initiative. Specifically, this RFI solicits input on needs for software and analysis methods related to data compression/reduction, data visualization, data provenance, and data wrangling.
Biomedical research is becoming more data-intensive as researchers are generating and using increasingly large, complex, and diverse datasets. This era of 'Big Data' in biomedical research taxes the ability of many researchers to release, locate, analyze, and interact with these data and associated software due to the lack of tools, accessibility, and training. In response to these new challenges in biomedical research, and in response to the recommendations of the Data and Informatics Working Group (DIWG) of the Advisory Committee to the NIH Director (http://acd.od.nih.gov/diwg.htm), NIH has launched the trans-NIH Big Data to Knowledge (BD2K) Initiative (www.bd2k.nih.gov).
The long-term goal of the NIH BD2K Initiative is to support advances in data science, other quantitative sciences, policy, and training that are needed for the effective use of Big Data in biomedical research. (The term "biomedical" is used here in the broadest sense to include biological, biomedical, behavioral, social, environmental, and clinical studies that relate to understanding health and disease). The term 'Big Data' refers to datasets that are increasingly larger, more complex, and which exceed the abilities of currently used approaches to manage and analyze. "Big Data" is also meant to capture the opportunities and address the challenges facing all biomedical researchers in accessing, managing, analyzing and integrating large datasets of diverse data types. Such data types may include imaging, phenotypic, molecular (including –omics), clinical, environmental, behavioral, and many other types of biological and biomedical data. "Big Data" also includes data generated for other purposes (e.g. social media, search histories, cell phone data) when they are repurposed and applied to address health research questions. Biomedical Big Data primarily emanate from three sources: (1) a small number of groups that produce very large amounts of data, usually as part of projects specifically funded to produce important resources for use by the research community at large, or large collections of electronic health records; (2) individual investigators who produce large datasets for their own project, but which might be broadly useful to the research community at-large; (3) an even greater number of investigators who each produce small datasets whose value can be amplified by aggregating or integrating them with other data.
One of the DIWG recommendations was to support the development, implementation, evaluation, maintenance and dissemination of informatics methods and applications. NIH supports a wide range of bioinformatics and computational science through efforts such as the Biomedical Science and Technology Initiative funding opportunities and through programs supported by individual NIH institutes and centers. NIH is now considering supporting the development of analytical methods and software tools and will focus initially on four targeted areas to begin to address critical current and emerging needs of the research community for using, managing, and analyzing more complex and larger data sets: data compression/reduction, visualization, provenance, and wrangling.
An NIH BD2K Working Group charged with exploring the development of informatics methods and tools seeks input from the biomedical research communities on the four targeted areas listed above to ensure that research resources generated will have the highest impact and value to the research community. NIH has determined that guidance is needed from broad scientific community in the following areas:
While data compression is important in BD2K since it helps reduce resource usage, most compression techniques involve trade-offs among various factors, including the degree of compression, the amount of distortion induced and the computational resources required to compress and decompress the data.
Data reduction aims to more dramatically reduce the data volume, and in the meantime reduce the complexity/dimensionality of data for easier analysis. It usually involves processing and/or reorganization of data to minimize redundancy, eliminate noise, and preserve signal and data integrity.
Data visualization permits researchers to communicate information through graphical and interactive means and enables them to explore and gain insight/knowledge from the data. The challenge in the Big Data era is on interpreting complex, high-throughput data, especially in the context of other relevant, but often orthogonal, data.
Provenance of digital scientific data is useful for determining attribution, identifying relationships between objects, tracking back differences in similar results, guaranteeing the reliability of the data, and to allow researchers to determine whether a particular dataset can be used in their research (by providing lineage information about the data).
Data wrangling is a term that is applied to the conversion, formatting, and mapping of data that enables researchers to more easily submit data to a database, expose data to the internet, and allows data to be more easily accessible and shareable. Researchers who generate datasets that, in aggregate, become "Big Data" often find it difficult to submit data, even when standards are well-established. Specialized informatics skills are often needed, for example, to format data, apply metadata, fill gaps, use ontologies, capture provenance, annotate features, and apply other functions to reformat, manipulate, transform, or process data.
To maximize the impact of these valuable research resources and tools (informatics methods and tools) and facilitate its use by scientists with a broad range of expertise, we seek input from scientific and informatics research and user communities in identifying and prioritizing needs and gaps in the four focus areas outlined above.
Submitting a Response
All responses must be submitted via email to BD2KSoftware@mail.nih.gov by Friday, September 6, 2013. Please include the Notice number in the subject line. Response to this RFI is voluntary. Responders are free to address any or all of the categories listed above. The submitted information will be reviewed by the NIH staff.
This request is for information and planning purposes only and should not be construed as a solicitation or as an obligation on the part of the Federal Government. The NIH does not intend to make any awards based on responses to this RFI or to otherwise pay for the preparation of any information submitted or for the Government's use of such information.
The NIH will use the information submitted in response to this RFI at its discretion and will not provide comments to any responder's submission. However, responses to the RFI may be reflected in future funding opportunity announcements. The information provided will be analyzed and may appear in reports. Respondents are advised that the Government is under no obligation to acknowledge receipt of the information received or provide feedback to respondents with respect to any information submitted. No proprietary, classified, confidential, or sensitive information should be included in your response. The Government reserves the right to use any non-proprietary technical information in any resultant solicitation(s).