Public Listing of Comments on Request for Information (RFI): Input into the Deliberations of the Advisory Committee to the NIH Director Working Group on Data and Informatics

Comments requested in the January 10, 2012 NIH Guide Notice
Entire Comment Period: 01/10/2012 - 03/12/2012


Entry Date Affiliation Organization
Name
Organization
City, State
Comment 1 Comment 2 Comment 3 Attachment text
01/11/2012 at 11:51:48 AM Self     Several months ago the NIH put out a Request for Information concerning Neuroimaging Data Sharing, and responses were supposed to be directed to a Dr Bjork. I am copying and pasting the same response I wrote to that RFI to this one. I hope this is helpful. I am extremely interested in this topic and I'm personally committed to help facilitate widespread sharing of raw scientific data, especially after publication!

Dear Dr Bjork: I am responding to the RFI concening Neuroimaging Data Sharing. I am a Prof of Radiology AT UCSF, Principle Investigator of the Alzheimer's Disease Neuroimaging Initiative (ADNI see adni-info.org), and an advocate of widespread sharing of scientific data (see Scientificdata sharing.com). For more than 5 years I have been actively lobbying NIH Institute Directors and officials at NLM to institute practices which facilitate and encourage widespread sharing of scientific data. As you may know, ADNI shares all clinical and imaging data without embargo with the entire worldwide scientific community. Therefore, I want to respond to your RFI, by making comments in the text of your announcement. My comments are inserted in TRACK CHANGES

In my view, reducing barriers to the standardiation and sharing of Neuroimaging data should not be narrowly focused. The NIH should be taking active steps to reduce barriers to standardizaiton and sharing of all scientific data, and neuroimaging data is a small subset of that. The standardization and sharing of Neuroimaging data will be greatly facilitated if NIH , especially NLM, take steps to facilitate, encourage and improve widespread sharing of all types of data!!! 1. Describe the current barriers to the standardization and sharing of neuroimaging data for secondary analysis. There are two major barriers to sharing of data: 1) Lack of an infrastructure for data sharing Its not easy to share. Currently, scientists or universities need to set up their own sharing system (we are doing this using DATAVERSE) but there should be a system put in place by NIH/NLM for widespread sharing of data. Once the systems are in place, scientists will use them 2) Cultural barriers to data sharing: scientists believe that they own their data. They are reluctant to share. These attitudes can be changed by: a) having systems in place for sharing data b) leaders in the field starting to share their raw data after publication c) NIH and Universities setting up incentives for data sharing including policies and procedures (see below) Responses can include comments or suggestions about strategies to incentivize data sharing. a. Mandatory (condition of award I WOULD NOT MAKE IT CONDITION OF AWARD THIS COMING YEAR. THIS IS A GOAL TO BE ACHIEVED AFTER 5 YEARS OR SO. HOWEVER, THERE COULD BE A PLACE IN THE GRANT APPLICATION WHERE PEOPLE CAN LIST THEIR PREVIOUS DATA SHARING EXPERIENCE. DATA SHARING PLANS SHOULD BE MANDATED FOR ALL GRANTS. DATA SHARING SHOULD BE REPORTED IN NON COMPETE RENEWALS AND IN COMPETITIVE RENEWALSor mandated by journals) b. Incentives to promote voluntary sharing and adoption of standard procedures THIS IS THE BEST WAY TO GO!!! FOR EXAMPLE, UNIVERSITIES COULD CONSIDER DATA SHARING WHEN MAKING HIRING AND PROMOTION DECISIONS. THE NIH SHOULD HAVE A PLACE IN ALL GRANT APPLICATIONS, ALL NON COMPETITIVE AND COMPETITIVE RENEWALS TO LIST SHARED DATA. ONCE PAPERS ARE PUBLISHED, THE RAW DATA AND THE DATA TRAIL SHOULD BE APPENDED c. Technical or infrastructure/IT barriers THE MAJOR BARRIER IS LACK OF A DATA REPOSITORY 2. Considerations of human-subjects protections OF COURSE THESE ARE ISSUES, BUT THEY CAN BE DEALT WITH a. Issues and experiences related to informed consent of subjects for their data to be: o stored indefinitely YES o stored in a repository other than of the laboratory of data collection YES o available to third parties for secondary analysis OF COURSE THIS IS THE POINT b. Issues related to subject confidentiality, with respect to o de-identification of brain image data o potential for identification of subjects based on brain images (such as gyrus and sulcus morphology) o extent of access for secondary analysis o HIPAA compliance 3. Suitability of different image data modalities for donation and federation across labs, and utility of different image modalities for secondary analysis a. Resting-state or task-evoked functional MRI b. Structural MRI (e.g. diffusion or volumetric) c. Other image modalities like MRS or PET d. Reconciling the need for standardization versus the value of methods innovation. STANDARDIZATION OF COURSE IS GOOD. BUT IF STANDARDIZATION IS IMPOSED, IT WILL GREATLY SLOW DOWN THE PROCESS 4. Conditions or qualifications under which data should be shared ALL DATA SHOULD BE SHARED AFTER PUBLICATION. THERE SHOULD BE NO REQUIREMENTS FOR STANDARDIZATION a. Applicant eligibility or qualifications; composition of data access committee b. Data quality or sample size ALL DATA c. Embargo by collecting lab (duration) before furnishing data ALL DATA AFTER PUBLICATION. FOR VERY LARGE GRANTS SOME PREPUBLICATION DATA SHOULD BE SHARED d. Financial considerations (e.g. fee-for-access, current NIH funding status) COMPANIES COULD BE CHARGED 5. Existing (including international) efforts, large scale databases, clearing houses, tools, and resources that facilitate sharing of imaging data THESE SHOULD BE ENCOURAGED, BUT SHOULD NOT BE THE ONLY WAY THAT DATA IS SHARED a. Lessons learned LARGE SCALE WELL DEVELOPED DATA BASES LIKE LABORATORY OF NEUROIMAGING (LONI) ARE REALLY GREAT. BUT SUCH A SYSTEM IS NOT GOOD FOR THE OVERALL COMMUNITY BECAUSE THE REQUIREMENTS FOR STANDARDIZATION ARE SO HIGH THAT THEY WILL GREATLY SLOW DOWN RESEARCH AND INHIBIT DATA SHARING. THERE ALSO NEEDS TO BE DATA REPOSITORIES WHICH TAKE ALL TYPES OF DATA INCLUDING NON STANDARDIZED AND POORLY DOCUMENTED DATA. IN OTHER WORDS, ALL DATA SHOULD BE SHARED ,EVEN IF THE SCIENTISTS HAVE NOT DONE A GOOD JOB DOCUMENTING THE DATA. b. Features and suitability for up-scaling to accommodate more data- especially of different modalities, or with different linked phenotypic data c. Methods for harmonizing data from different extant or emerging extramural repositories 6. Issues in standardization, sharing, and maintenance of image data and linked phenotypic or genetic data THERE ARE MANY ISSUES HERE, BUT THE PUSH TO ENCOURAGE AND FACILITATE WIDESPREAD SHARING SHOULD NOT BE SLOWED DOWN BY ISSUES OF STANDARDIZATION AND PROVENENCE a. Advantages, disadvantages, and costs of centralized repositories versus distributed hosting. THE NIH NLM SHOULD HAVE A NATIONAL REPOSITORY FOR ALL SCIENTIFIC DATA. DISTRIBUTED HOSTING HAS THE PROBLEM OF WHAT HAPPENS IF ONE HOST RUNS OUT OF FUNDS. FOR EXAMPLE, WHAT HAPPENS IN THE FUTURE WHEN ARTHUR TOGA RETIRES, WHEN LONI CEASES TO EXIST? BUT THE NATIONAL LIBRARY OF MEDICINE WILL ALWAYS BE WITH US (HOPEFULLY) b. Computational infrastructure, database management, and data processing software/algorithms c. Data and meta-data formats, common data elements d. Challenges, costs, sources, and other issues of long-term data storage (curation) THE COST OF LONG TERM STORAGE IS AN ISSUE, BUT WE ARE DEALING WITH THIS AT UCSF AND ITS NOT AS BAD AS SOME MIGHT GUESS e. Linking image data with genetic or genomic data OF COURSE IMAGE DATA NEEDS TO BE LINKED WITH ALL TYPES OF CLINICAL INFORMATION f. Interface with centralized referral hubs, the HCP, or similar NIH-initiated infrastructures 7. Issues of phenotypic harmonization or definitions with respect to: a. Developmentally-sensitive measures (suitability at different life stages) b. Objective behavioral metrics c. Psychometric measures d. Psychiatric disorders, other brain disorders, or histories of substance use e. Family history, environmental or other psychosocial factors f. Expansion of use (or a mandate) to adopt consensus phenotypic measures

I hope you find these comments helpful. I very much care about promoting and facilitating widespread data sharing. Imaging is part of it, but only a small part! This should be done as part of a much bigger effort on the part of NIH/NLM!!!!!

    Text of attachment same as that of comment boxes.
01/13/2012 at 03:43:57 PM Self BioMedware, Inc Ann Arbor, Michigan Within the framework of Data and Informatics a critical area of some importance to researchers is confidentiality protection and human subjects research. The importance of privacy protection for research subjects is undeniable and widely acknowledged. The drag imposed on the pace of many phases of research is substantial, and largely unmeasured. Would it not be wonderful if we had means of protecting confidentiality without slowing down human subjects research?

I submit this question should perhaps be part of the committee agenda?

I quote from an R21 proposal of mine to the NLM that received a score of 11.

** BEGIN QUOTE **

"What is most puzzling and distressing is that, in spite of our increasingly sophisticated technology and electronic data systems, researchers' direct online access to federal vital records data has become increasingly limited over time, impeding and sometimes precluding potentially valuable etiologic investigations" (Wartenberg and Thompson 2010) pg 409 Protection of confidentiality is a paramount concern in human subjects research, yet limits data sharing and access to the very information that is required to accomplish significant, rapid advances in public health (Wartenberg and Thompson 2010). The ability to share and analyze confidential data with a minimal risk of releasing private information is a major advance that would accelerate both basic and applied research. **END QUOTE**

I suggest that the accomplishment of this major advance be considered by the committee, as it would accelerate research across NIH institutes.

   
01/14/2012 at 11:52:40 AM Self Computational Biomodeling (CoBi) Core, Department of Biomedical Engineering, Cleveland Clinic Cleveland Clinic Data dissemination by scientists seems to be an afterthought when compared to writing grants and publications. Data dissemination should have a similar priority and rewarding as scholarly publishing so that investigators spend adequate effort to timely provide documented data to the public. This needs to be resolved in a collaborative manner between funding agencies, institutions, and scientists as it will require changing of the scientific culture. A while back, we had some discussions on how data dissemination may fit to current culture of scientific conduct, see details in http://www.imagwiki.nibib.nih.gov/mediawiki/index.php?title=Journal_for_Dissemination Research activities will potentially get leaner, be more timely, and flourish collaborations if: 1) data are available to the community (one can access the data), 2) data are documented (one can use the data), 3) data are standards compliant (one can exchange the data easily). The burden of effort to the investigator (or the community of investigators) can dramatically increase from item 1 to item 3. Nonetheless, the investigator(s) should at least accomplish item 1. NIH need to ensure compliance to item 1 by funded investigators, and facilitate items 2 and 3 by providing the means to share the burden. Issues related to intellectual property (ownership of data and expected benefit to the owner through its utility without dissemination) may constrain provision of data. It may be possible for NIH to provide options to keep data private or to provide it according to some terms of embargo. Projects incorporating full and timely (without embargo, at the time of manuscript submission for example) dissemination of data may need to be rewarded. Or, projects that ask for private ownership or delayed dissemination may need to provide a cost sharing plan to accommodate the lack of (or delayed) public access.

I find data access through PubMed and/or RePORTER very intriguing and I feel like it will facilitate not only easy access but also locating the data in a relative manner. Nonetheless, new programs may be necessary to support technology development (or refurbishing) to enable quick search and access to relevant data. Publishing or internet search industry may retrofit their tools to accommodate data search and to provide means for universal and unique identification of data sets.

 
01/17/2012 at 09:53:26 AM Self     While sharing traditional research databases among investigators will push research forward, we will need to exploit electronic health record (EHR) data to significantly increase the size of the sample (one billion visits per year), the diversity of the population, and the length of follow-up time compared to what is currently feasible. Unfortunately, health records bring an additional critical challenge: in addition to the need for standards, accessibility, and privacy, the records carry biases related to the health care process that produced them. For example, more severely ill patients can appear to have lower risk and causes can appear to follow effects. Therefore, the issue is that we will soon have a huge store of data available to us, but we do not yet understand how to use it effectively. (See attached letter for more information.) While I believe that standards, accessibility, and privacy are important and difficult to address, I believe that the limitations of the data and the biases that they produce will be a much more difficult problem that will require significant work and investment. I believe that the answer lies in formal study of the EHR as a system in itself and exploiting a broad array of techniques from fields outside of medicine. The EHR is "big" not just in the number of bytes or number of cases, but in its complexity and in the interdependence among its data elements. While there has been ongoing work studying the data content of the EHR at the concept level (terminology, data model), there is less formal characterization of information content of the EHR and of the biases inherent in the EHR. The By treating the EHR as a formal time series, for example, we may be able to exploit existing research. While it is early in this process, I can give three examples here. (1) The raw data values in the EHR are subject to bias and variability, but it may be possible to define derived properties that are less sensitive to inter-patient variability and the peculiarities of the health care process. (2) Aggregating data across patients can induce spurious associations, and mathematical methods are being developed to improve aggregation. (3) A health process model, which explicitly represents the relationships and biases in EHR data, may support the correction of or avoidance of EHR biases. Many fields stand to contribute to this work. Some of the work is being done wholly outside of health care but has immediate applicability; work in applied physics is one example, but applied math, philosophy, statistics, biostatistics, computer science, and other fields stand to contribute in addition to the core field of biomedical informatics.

My recommendation is therefore to increase the visibility of this problem and potential solution, fund research in this area, and encourage collaboration among a broad array of fields. Furthermore, I believe that such work will facilitate not only research based on EHR data, but also research based on aggregating traditional research data sets because such data may also be subject to the challenges of the health care process.

I strongly endorse the work of the Advisory Committee to the NIH Director Working Group on Data and Informatics. I am currently professor and chair of the Department of Biomedical Informatics at Columbia University, I am co-chair of the Meaningful Use Workgroup of the Office of the National Coordinator of Health Information Technology, and I have served as PI of NIH-funded research on secondary use of electronic health record data for 18 consecutive years. I believe that the challenges enumerated in Notice NOT-OD-12-032 are very appropriate, but I would like to emphasize one additional issue. While sharing traditional research databases among investigators will push research forward, we will need to exploit electronic health record (EHR) data to significantly increase the size of the sample (one billion visits per year), the diversity of the population, and the length of follow-up time compared to what is currently feasible. Unfortunately, health records bring an additional critical challenge: in addition to the need for standards, accessibility, and privacy, the records carry biases related to the health care process that produced them. For example, more severely ill patients can appear to have lower risk1 and causes can appear to follow effects.2 The state of the art in using EHR data is somewhat heuristic and ad hoc, and largely unchanged for the past two decades: given some clinical concept that needs to be extracted from an EHR for a research study, a knowledge engineer and domain expert iterate on a clinical definition of the concept, checking the EHR results against a manually generated gold standard until some threshold of performance is achieved. The process is slow and still subject to bias. Advances in natural language processing and ontologies have helped over the years, but there have been no major breakthroughs. I believe that the answer lies in formal study of the EHR as a system in itself and exploiting a broad array of techniques from fields outside of medicine. The EHR is "big" not just in the number of bytes or the number of cases, but in its complexity and in the interdependence among its data elements. While there has been ongoing work studying the data content of the EHR at the concept level (terminology, data model), there is less formal characterization of information content of the EHR and of the biases inherent in the EHR. By treating the EHR as a formal time series, for example, we may be able to exploit existing research. While it is early in this process, I can give three examples here. (1) The raw data values in the EHR are subject to bias and variability, but it may be possible to define derived properties that are less sensitive to inter-patient variability and the peculiarities of the health care process.3,4 (2) Aggregating data across patients can induce spurious associations, and mathematical methods are being developed to improve aggregation.5,6 (3) A health process model, which explicitly represents the relationships and biases in EHR data,2 may support the correction of or avoidance of EHR biases. Many fields stand to contribute to this work. Some of the work is being done wholly outside of health care but has immediate applicability; work in applied physics is one example,5 but applied math, philosophy, statistics, biostatistics, computer science, and other fields stand to contribute in addition to the core field of biomedical informatics. My recommendation is therefore to increase the visibility of this problem and potential solution, fund research in this area, and encourage collaboration among a broad array of fields. Furthermore, I believe that such work will facilitate not only research based on EHR data, but also research based on aggregating traditional research data sets because such data may also be subject to the challenges of the health care process. You have assembled an excellent committee to address these important issues, and I look forward to the outcome. Please contact me with any questions. 1Hripcsak G, Knirsch C, Zhou L, Wilcox A, Melton GB. Bias associated with mining electronic health records. J Biomed Discov Collab 2011;6:48-52. 2Hripcsak G, Albers DJ, Perotte A. Exploiting time in electronic health record correlations. J Am Med Inform Assoc 2011 Nov 23. 3 Sperrin M, Thew S, Weatherall J, Dixon W, Buchan I. Quantifying the longitudinal value of healthcare record collections for pharmacoepidemiology. AMIA Annu Symp Proc. 2011;2011:1318-25. 4Albers DJ, Hripcsak G. A statistical dynamics approach to the study of human health data: resolving population scale diurnal variation in laboratory data. Physics Letters A 2010;374:1159-64. 5Komalapriya C, Thiel M, Romano MC, Marwan N, Schwarz U, Kurths J. Reconstruction of a system's dynamics from short trajectories. Phys Rev E Stat Nonlin Soft Matter Phys. 2008 Dec;78(6 Pt 2):066217. 6Albers DJ, Hripcsak G. Using time-delayed mutual information to discover and interpret temporal correlation structure in complex populations. Chaos, accepted for publication.
01/19/2012 at 11:17:12 AM Organization NorthShore University HealthSystem Evanston, IL Notably absent from the RFI areas are computational platforms certified for use of identified data and identified data linkage across institutions.

There is the related issue of the fact that we both collect and use data simultaneously for both quality improvement and research (very very often all four in same projects), under the purpose of improving health nationally and locally with maximal respect to privacy and confidentiality, while current regs make us do all sorts of segmentation of roles (within the same people) to a degree of practical absurdity...

These two issues of computational platforms/methods for identified data linkage and analysis across institutions, I believe, are not commonly in use due to HIPAA and Common Rule practical considerations rather than because they are not of interest to investigators or appropriate for consented or research under waiver of consent. I imagine a distributed system for medical science across 6 major institutions in Chicago which often share patients, but as long as step one (imagining the data set) is 'off the table' (due to no accepted method for hosting linked data), it cannot happen.

There are many proactive methods by which this can have equal or more likely better patient confidentiality and privacy than the current 'patchwork of exception' approach!

(sometimes in policy-making its thinking of what we don't do today that is the hardest - then identifying policies or processes that would make those activities possible can be attacked)

 
02/01/2012 at 11:44:58 AM Self     It is nonesensical that NIH requires, and goes to great pains to enforce, diversity in sampling; yet has no conicident requirement to conduct and report on differential validities due to race, gender, age, etc. Consequently, very little of this sort of research is ever conducted despite having sufficient data. NIH SHOULD REQUIRE SOME MINIMUM AMOUNT OF DIVERSITY ANALYSES BE REPORTED IN PROGRESS REPORTS IN ORDER TO PROD PI'S INTO DOING MORE.      
02/02/2012 at 04:14:01 PM Self     The current list of areas does not identify data quality as an area of focus for this agenda. There currently exists no established data quality assessment methods, no established data quality standards, and no established data quality descriptors that could be attached to each data set. In the absence of data quality descriptors, a down-stream user of the data has no ability to determine if the data set is accpetable for the intended use. A data set that is acceptable for one use may or may not be acceptable for a different use.

The relevant "features" for describing data quality are likely to be markedly different for different data domains(sequencing, expression, imaging, clinical).

There are no standard protocols or minimal requirements for assessing data quality, for data "cleaning" procedures, and for describing post-measurement data quality alterations.

     
02/03/2012 at 12:24:41 AM Organization UW-Milwaukee Milwauee, Wisconsin Critical issues/impacts on scientists: Making available biomedical data, especially clinical data.

Critical issues on institute: Innovative and cost-effective technologies for integrating, managing, accessing, and making sense of the big data in health care.

Innovative and cost-effective technologies for integrating, managing, accessing, and making sense of the big data in health care. Biomedical natural language processing is one of the key area for data integration.

We have big data: genomics data, EMR, image, video, speech, longitudinal data, etc, which if integrated in the right way will have a big impact on health care.

NIH should evaluate each PI's publications to a grant and make such a record publicly available.  
02/08/2012 at 10:35:24 AM Organization American Society for Biochemistry and Molecular Biology Rockville, MD see attachment see attachment see attachment THE NEED FOR LONG-TERM PROTEOMIC DATA STORAGE SUMMARY The lack of a reliable and secure repository for raw data is a major problem facing science. While there are various repositories for 'processed' information these have substantial limitations and thus only serve a portion of the need (and the community), and importantly cannot store 'raw' data. Therefore there is an essential need for such an entity. This can be accomplished by providing long term fiscal support for creating an over-arching structure, actually capable of capturing not only raw data but also various forms of processed information that would provide a central storehouse. SUPPORTING EXAMPLE: PROTEOMICS OVERVIEW One of the most significant hallmarks of biomedical research in this century, and perhaps one of the most unexpected, has been the size and extent of data sets that have and continue to be generated by the new technologies associated with genomic, transcriptomic, proteomic and metabolomic research (collectively the bio-omic sciences or, by some definitions, systems biology). The microarray field that underpins transcriptomics led the way but it has been supplanted by the massive outputs of next gen nucleic acid sequencing of vast numbers of human and other genomes. However, proteomic data, mostly generated by high throughput mass spectrometry (MS), will eventually dwarf both of these and when coupled with metabolomic data that will likely be collected with similar technology, is destined to create an almost unimaginable amount of information. At issue, therefore, is how to deal with this onslaught? Clearly the problems for the individual 'omic sciences are not the same, as the types of data are quite different (excepting proteomics and metabolomics). Germane to this report is the collection and interpretation of MS data. There are several issues and several levels of data and each requires its own consideration. For ease of presentation, MS data, in support of a proteomic (or metabolic) experiment can be classified as 'raw', 'processed' or' interpreted'. The interpreted data are suitable for publication and for inclusion in searchable web-based compendia. These are outputs of search engines, which have interpreted the processed data in the form of peak lists or spectral libraries, and can involve additional software analyses including, but not limited to, quantification and functional assessments. Journals that publish proteomic data have various requirements for what information must be included in research articles and how much of the data from which the identifications of peptides, proteins and post-translational modifications (PTM) were extracted must accompany the manuscript (during review and/or ultimately appearing in the journal, mainly as supplemental material). The extent to which the validity of these assignments can be assessed is accordingly equally variable. To address this issue, Molecular & Cellular Proteomics (MCP), starting in 2003 and culminating in 2005 (1), developed and adopted publication guidelines for reporting MS identifications and has subsequently updated them (2). As part of this evolution, in 2010 it announced (3) that it would require the deposition of the raw MS data in a public database as a requirement of publication for all accepted papers containing MS identifications. While not mandating it as a requirement, other journals publishing in this area of research supported this policy. For all practical purposes, Tranche, founded and operated out of the University of Michigan, is the only repository capable of handling this type of data submission. Unfortunately, technical problems, due mainly to inadequate fiscal support and largely manifesting themselves in the past year, have substantially curtailed the usability of Tranche and in March of the past year MCP was forced to make raw data deposition once again voluntary. Although the situation has shown signs of improving in the last six months, there is no sustained support of Tranche that has been identified. Thus, at the moment there is not a suitable and reliable repository for raw MS proteomic data available. Why deposit raw MS data? There are a number of reasons for why this policy should be universally adopted. First, the interpretation of MS data depends on software analyses and there is considerable variation in the search engines, how they make their determinations, and how they decide whether a result is reliable. It is important to understand that generally less than 50% of the spectra generated in an experiment are interpreted (and sometimes considerably less than that) and that assignments are given scores that indicate the probability that the identification is right after making certain assumptions about what could be in the sample. This is compounded by errors in the databases searched and in the possibility of matching a correct sequence to an incorrect protein. This is considerably exacerbated when PTMs are involved and localizing the modification sites correctly is clearly the most challenging analysis of all. The most effective way to re-examine an assignment is to have access to the raw data. Related to this, software for processing and interpreting MS data continue to improve, so re-analysis of datasets with newer software is likely to lead to the extraction of more information from previously acquired data. However, this can only performed if the raw data is available. Second, essentially all experiments are designed and executed with a purpose, i.e. there is a biological question being addressed. This means that the data will be analyzed from the orientation of this objective, and other information present in the data set will likely be ignored or simply not identified (i.e. be part of the 50% or greater of the data that was not explained during the data analysis). In addition, quantitative information present in the raw data may not have been examined (only qualitative analysis; i.e. peptide and protein identification is performed for many datasets). In fact, it may not be possible to interrogate a data set at the time it is collected for a specific question or possibility because the requisite findings that underlie it had not been previously determined. In essence, this is a manifestation of the axiom that one "sees only what one looks for". This is particularly true for PTM analysis, as for most datasets only a very limited number of PTMs are considered during data analysis. As a result, potential large amounts of information are not analyzed and the information contained therein lost if the raw data is not made available. This is enormously wasteful from both an intellectual and financial point of view. Finally, knowledge is a continuum and all data collected adds to it. This is particularly important to the bioinformaticians and other analyzers of processed and interpreted data, who can provide the larger prospective that helps to produce the global understanding of biology and medicine, which is the real goal of the bio-omics. By not reporting the actual data collected or placing it where it can be used by others, it defeats a major part of what experimentation is supposed to be about. It must fairly be pointed out that not everyone is in favor of raw data deposition. Some individuals, clearly recognizing that large MS data sets have unused or undiscovered potential and not wishing to have this be exploited by others, do not want to share their raw data, rather hoping to find new things in it themselves. Others are concerned that their misinterpretations and mistakes would be made plainly (and painfully?) available for others to point out and thus for all to see. And lastly, some simply don't want to be bothered with the hassle of making the necessary uploads, which can indeed be time consuming. Although in part understandable (from the human nature point of view), none of these reasons are particularly compelling or scientifically and fiscally well justified. What is needed? It should be made clear that the shortcomings of Tranche are basically due to lack of support rather than any inherent design flaws. It was created as an academic exercise and was largely supported initially by grant funds. When these were ultimately not renewed, it became difficult to maintain the servers and deal with user problems. Ultimately the principal designers and creators of Tranche left the project and were not appropriately replaced for financial reasons. Although data does still flow in and out of Tranche, the reliability of these activities and consequently the integrity of the data is not at earlier levels (and below the threshold that could be tolerated by MCP, leading to its decision not to make raw data storage mandatory until the situation is sufficiently rectified). While an infusion of money would certainly help (and there has been recently a small amount generated by the ProteomeXchange network, supported by a grant from the European Union), it is the consensus of a number of interested parties, which has been expressed at several international workshops and meetings, that either permanent support for Tranche needs to be identified or a new entity needs to be created with a reliable basis of support that would ensure the long term viability of the enterprise. The latter, which could be described as an International Repository for Proteomic Data (IRPD), would require a central facility and mirror sites placed in appropriate locales internationally and be staffed with network administration / IT staff to oversee its operation. The stakeholders in such an IRPD would be of several varieties. First and foremost, the publishers of the main proteomic journals would be expected to be prime users. The American Society for Biochemistry and Molecular Biology, who publishes MCP, would be a strong supporter of such an activity but it can be expected that other publishers would be as well. The Nature Group is on record as actively supporting raw data deposition. Based on the activities with other 'omic sciences, various private and public funding agencies are likely to be so as well and instrument and software vendors have a vested interest in this process (and have actively participated in workshops and discussion panels addressing this issue). Various government laboratories and agencies have also expressed support in the past. Finally, there are the end users - the scientists who create this data and then ultimately use it for different purposes. There seems to be no lack of support for the concept among any of these groups - only in the process of administration. Such a repository would presumably also become part of the ProteomeXchange consortium, which has international membership, who would be able to provide additional advice and potentially limited financial support. REFERENCES 1). R. A. Bradshaw, A. L. Burlingame, S. Carr and R. Aebersold (2005) "Protein Identification: The Good, the Bad, and the Ugly" Mol Cell Proteomics 4: 1221-1222. 2). R. A. Bradshaw, A. L. Burlingame, S. Carr and R. Aebersold (2006) "Reporting Protein Identification Data: The next Generation of Guidelines" Mol Cell Proteomics 5: 787-788. doi:10.1074/mcp.E600005-MCP200 3). R. A. Bradshaw and A. L. Burlingame (2010) "Technological Innovation Revisited" Mol Cell Proteomics 9: 2335-2336. doi:10.1074/mcp.E110.005447
02/08/2012 at 01:57:32 PM Organization University of Rochester School of Medicine and Dentistry Rochester, NY See attached file See attached file See attached file Scope of the challenges/issues o Research information lifecycle: Need more statisticians and bioinformaticians to be involved in the stage of experimental design before bioinformatics data collection. Encourage or train more bioinformaticians with sophisticated analysis skills to take leadership or initiative in designing and proposing new studies. o Unrealized research benefits: Not enough proactive bioinformaticiains and statisticians are trained to explore a huge amount of experimental data that cost a lot of money to produce. so that many of those data are wasted. o Feasibility of concrete recommendations for NIH action § More training grants for bioinformaticians and biostatisticians with strong scientific background § More bioinformaticians/statisticians-initiated projects and programs should be funded by NIH. Incentives for data sharing o "Academic royalties" for data sharing (e.g., special consideration during grant review): may add a 6th grant review criterion, i.e., "data sharing track record", which may include: 1) the number of publications that re-used the data from your lab and you serve as a coauthor of the papers; 2) the number of publications that re-used the data from your lab and you are not a coauthor of the papers. Support needs o Analytical and computational workforce growth: More training grants for bioinformatics, computational biology and biostatistics o Funding for tool development, maintenance and support, and algorithm development: More grants for methodology/tool development and maintenance
02/08/2012 at 02:17:23 PM Organization Emory University Atlanta, GA My proposal/request is expected to lie a bit outside the areas of focus of your group. However, I deem it to be a very important issue that is not being discussed or tackled by any other group and I have a fairly specific proposal. The issue I write about is not large datasets but laboratory notebooks, maintained by everyone working in a lab throughout the US. There is a clear need to switch to electronic record keeping that is searchable and easily transferable. There is also a very important need to link information coming from different sources to one lab notebook. There is also a growing problem with reagents (plasmids, cell lines, etc) being lost from storage as a result of poor record keeping, lack of a searchable database, or changes in freezers, etc. I would like to suggest/encourage/ask that the NIH develop software for laboratory notebook keeping that integrates the standard lab notebooks chores of documenting all aspects of the lab work, allowing links to figures, raw data in several file formats, and storage locations. This software should be freely accessible to laboratories in an open source format to allow changes and customizing. It should be controlled by NIH to allow updates and not simply outsourced to a company that may go out of business shortly after completing version 1.0. Currently available electronic notebooks are very expensive, not readily modified, and only used by a few labs. Having a base package that everyone can use will save the government many fold the cost of development as significant time is currently spent in trying to find or remake reagents, repeat experiments that cannot be found in notebooks of people who have left the lab, etc. Such an electronic notebook should not be mandatory but if the product is user-friendly and seen to be useful I expect it will be rapidly adopted by the scientific community. Right now, I have floppy disks, CDs, DVDs, and files on our server along with many paper notebooks from current and past lab members that contain information that will likely be lost for a variety of reasons.  
02/28/2012 at 11:41:00 AM Self     This working group is clearly addressing an issue that is both important and timely, as the range of high throughput tools are beginning to be effectively deployed. The scope of the resultant data output will transform the entire field of biology. Issues such as standardization and data housing may be somewhat mundane, but are obviously handled most efficiently by a central source. NIH is it. My specific point is that NIH has already had success in addressing larger scale data collections in several contexts, the most obvious of which is genome sequences. However, there is a major breadth versus depth problem, and it is particularly encouraging that more focused efforts have been very successful and well received by the community. The specific programs that I am most familiar with, NURSA and BCBC, are both highly effective, ground-up efforts that have evolved into effective managers of large scale data sets. Large scale data must be there and be available, but its in depth analysis will require specifically focused programs like these. Thus, it is crucial for NIH to support not only infrastructure, but also output and utilization.  
02/29/2012 at 02:16:50 PM Organization Dana-Farber Cancer Institute, Harvard school of Public Health Boston, MA 1. Standards development Most importantly, meta-data consistency for public data repositories such as GEO. For expression profiling and ChIP-seq, the cells, tissues, conditions used in the experiments are often annotated in a very ad hoc manner.

2. Data accessibility NCBI has discontinued hosting of high throughput sequencing data (e.g. 1000 Human Genome Project), but I hope they still accept functional genomics data (expression profiling, ChIP-seq, etc). GEO and SRA is one of the most valuable resource NCBI provides to the scientific community besides PubMed and EntrezGene.

3. Incentives for data sharing Many laboratories do the bare minimum to submit their data to GEO so as to obtain an GSMID to satisfy journals. There should be incentives for researchers to provide consistent and detailed meta-data annotation to the experimental data they are submitting. Special credit should be given during funding decisions to scientists who no only publish good papers, but also whose data are used by many other people. So it is also important to keep track of the source of high throughput data in papers (through paper acknowledgement of data source).

4. Support needs There should be increased funding support for developing and more importantly maintaining algorithms for high throughput data analysis. This would be extremely helpful for the greater experimental biomedical community.

The most important area is for data sharing, whether throughput improved incentives for sharing for data generators, or through support for intramural infrastructure to host the data for public access. Data sharing and reusing will increase the impact of NIH research funding, since data generated from one grant could help many scientists in their future research in a multiplier effect.    
03/02/2012 at 05:40:59 PM Self     This is a personal series of recommendations and comments. I apologies in advance as this response is on the short side and could well be to general to be useful, but time limitations mean I am not able to provide a longer response.

In terms of critical issues, that would have impact on scientists, I feel discussion in the following areas could be useful:

1) Would it ever actually be practical to provide a flexible data and laboratory management system? At a simplistic level this should be the easiest task to perform. However, across basic and clinical science there is considerable diversity in these system with little reuse. There is no "one size fits all" solution, and this makes even such a simple issue as this become a complex problem. So the question could be: if we can't provide a system like this, would it ever be practical to provide any universally useful software? Typically any such solution has been plagues by problems (and this is more than just caBIG). There have been a large number of proposed systems for providing these tools, and I expand upon this a little more in section 2.

2)What are the non mainstay innovations that will/could be required? To meet some of the challenges in terms of "population scale" analysis we need a fundamental change in how software is being developed, the methodologies used and the under lying hardware configurations. Such forward thinking seems to be within the remit of the group. Examples of innovations could include: considering how affordable and usable HPC can be made available (e.g. easier to use programmable chips or GPUs, extensions to PIG or other scripting systems for distributed processing/HDFS) or how we can develop scalable/affordable/usable software more easily without introducing constraining requirements on teams (e.g. education, reuse of open-source initiatives (see section 3)).

3) Can we help develop better more reusable tools? Basically what is required for development of good scientific software. These issues are well known, however they might be worth distilling down to practical guideless and education. It is also important to ensure they are practical for research (e.g. accepting that development is going to be ad-hoc, accepting file systems and standardized file formats generally offer a more flexible solution for scientists, standards and services are generally better if they are "bottom up" derived, etc). Encouraging a change in attitudes (e.g. adoption of existing well design software) will require a concentrated effort by NIH (e.g. good education).

While unfashionable I would say that the one important issue is to consider why so many different solutions and initiatives in data management have not succeeded. This means that each group is constantly writing their own solutions, which leads to a large amount of waste and a series of (generally sub standard solutions.

Basically just considering what works and what does not (both in terms of data standards and standardized software) would be useful. While this may seem a fairly negative focus, it does have positive spin. There is a long history of data integration software and data standardization developed specifically for the life sciences. The important issue is to establish which were successful and why. There have been a number of successful ones (e.g. SRS, Galaxy, Biomart, GenePattern), and a much larger list of unsuccessful ones. The unsuccessful ones have not suffered due to lack on investment (e.g. efforts by Microsoft and NCI). Instead they have generally do not actually do what is required, or have misunderstood what is required (e.g. must be quick to learn/develop, must be flexible etc).

Basically, by looking to the past we can quickly identify impractical solutions that are unlikely to succeed (e.g. comprehensive all inclusive framework to support all biology, standards that are too simplistic to be useful and/or too complex to use).The problems seen in many of solutions is rather strangely that they have no met the required level of maturity (do they even build), flexibility (are they componentized with multiple open interfaces) and usability (do they require little effort to use).

Two immediate areas that NIH polices/processes could change in this area are in both the assessment criteria of specific grant calls and attitude towards dissemination/co-development.

I feel that the more informatics focused grants need to be both targeted and better explained. The current "go to place" for the development of software is the ubiquitous "software maintenance grants" or the computational biology ones. The former suffers as the criteria which is used to assess the grants is based too much on individual whims of the reviewers. The criteria used could depend more on the software usage in question. There are basically three different types of software, and these should be assessed differently: is it a crucial bit of software in the field, and so the usage is less important than the niche it fills; is it a general widely used piece of software, then the number of users and its monopoly come into consideration; or is it forward looking, in which case looking to the potential is important.

NIH could consider using criteria associated with the contributing and supporting existing (probably non-biological) initiatives which would be useful. These include numerous Apache projects and the like. These are useful and commonly used, but require some specialization to make them directly suitable biomedical research (e.g. content repositories with auditing/tractability, web framework with pre-canned biological visualizations, annotation servers with appropriate thesauri, BAM file access in nearly everything, etc). Providing resources to these projects that with a direct and measurable focus on making them useful for biomedical researchers would provide enormous benefits, would even help the greater community, and would not favor any specific research group.

 
03/07/2012 at 10:24:37 AM Organization Carnegie Mellon University Pittsburgh, PA 15213 See attached. See attached. See attached. On behalf of Carnegie Mellon University and the roughly 4,000 faculty and staff we represent, we write to thank you for issuing this Request for Information (RFI) and to share our perspective on policies regarding the management, integration, and analysis of biomedical datasets. Carnegie Mellon is a small, private university with over 11,000 students and 86,500 alumni. Recognized for our world-class programs in technology and the arts, interdisciplinary collaborations, and leadership in research and education, we are innovative and entrepreneurial at our core.1 Carnegie Mellon's 2011 financial statement reports that 38.4% of our total revenue was from sponsored projects, totaling $360.9 million. Federally funded projects account for $317.59 million (88%) of this revenue.2 Our community creates large volumes of federally funded research data. Though we do not have a medical school, we have a Biomedical Engineering Department, a Bone Tissue Engineering Center, and a Center for Bioimage Informatics. We also participate in the National Resource for Biomedical Supercomputing (NRBSC), with funding from the NIH. We strongly support broad sharing of biomedical and other datasets gathered in federally funded research projects because open data increase productivity, innovation, and commercialization.3 Developing open data policies and standards that facilitate sharing is in the national interest and warrants careful examination. We address here several of the areas identified in the NIH RFI, specifically standards development, secondary/future use of data, incentives for data sharing, and support needs. Standards Development The critical issue is the lack of standards and best practices in many areas relevant to sharing data. The lack of standards makes it difficult if not impossible to share data effectively and to preserve it over time. Data dissemination and preservation are crucial to reducing redundancy, verifying results, and increasing the integrity and productivity of science. The recent report by the Committee for Economic Development provides empirical evidence of the impact of increased openness on follow-on research, innovation, commercialization, and economic growth.4 This section provides an overview of areas in need of standards and best practices and suggestions for how the NIH could advance developments. Data types and formats. Disciplinary differences in data types and formats must be addressed through standards and best practices. Even within the NIH, disciplinary differences exist in data formatting, documentation, and curation. The agency should facilitate the development and dissemination of standards and best practices for sharing and preservation of digital data by: Maintaining open access copies of relevant standards and best practices for data management (including metadata).5 Requiring NIH-funded research communities that do not yet have relevant standards and best practices to develop them within a specified time frame. NIH can identify disciplines that are poorly prepared to comply with data sharing policies and encourage them to collaborate with experienced and trusted partners, e.g., university libraries and research communities well-versed in data sharing. Participating in and funding standards development activities. (See Support Needs below.) Data citation. Standards for data management and re-use should include standards for acknowledgement of data used in publications and citation of data products. Ideally, the descriptive metadata bundled with the dataset will convey the licensing terms and include a list of those to be attributed. (See Incentives for Data Sharing below.) However, the attribution of credit for datasets is a relatively new field of endeavor. Many groups are working to develop best practices for data citation in the sciences and humanities. Strict guidelines for data citation cannot yet be provided, but federal agencies requiring data management and sharing can provide ongoing guidance for data citation, keeping close watch on new developments in the field6 and keeping researchers informed. In addition, federal agencies requiring data sharing should fund research into best practices and systems for data citation to accelerate the development of guidelines for researchers in different disciplines. Trusted data repositories. Standards and best practices must also be created for trusted repositories committed to open data and its preservation. The federal government should establish minimal service criteria to be met by such repositories, for example: Support for appropriate open data licenses. (See Incentives for Data Sharing below.) Trusted repositories must be prohibited from converting data deposited in an open format into a proprietary format upon retrieval or download. Support for relevant standards and best practices for access, interoperability, and preservation, including metadata, protocols, hardware, software, and unique persistent identifiers for datasets, researchers, and organizations.7 Searchable metadata that includes the licensing terms and, if attribution is required, a list of those requiring attribution.8 (See Incentives for Data Sharing below.) Verification of data integrity at ingest and retrieval / download. Security, redundancy, migration, disaster preparedness, and other preservation strategies, including the rights and technical metadata needed to preserve digital data. A mechanism for reporting problems. A mechanism for determining storage and preservation costs and a commitment to containing costs through cooperative agreements and economies of scale. Licensing agreements (between the repository and the owner of the dataset) that grant the rights necessary to preserve open data.9 Trusted repositories will have a commitment to long-term maintenance of digital datasets documented in a service-level agreement, and the financial resources and knowhow to sustain the operation.10 To facilitate the development of trusted repositories for open data, federal agencies with data management requirements should work with university libraries, disciplinary societies, research consortia, and other stakeholders to distribute the many responsibilities associated with establishing and maintaining a trusted repository for digital data.11 Secondary / Future Use of Data The primary stakeholders in secondary use of federally funded biomedical datasets are (a) those who fund the research,12 (b) those who conduct the research, and (c) the human research subjects who provide the data by consenting to be studied. The critical issues are protecting the rights of these three groups. From the funders' perspective, the critical issues are maximizing return on investment and accountability, interests well served by openness and secondary use of data. From the researchers' perspective, the critical issues revolve around acknowledging and leveraging their rights to the data. From the research subjects' perspective, the critical issues revolve around privacy and informed consent. This document addresses researcher concerns in the section on Incentives for Data Sharing. This section addresses human subject concerns. Institutional Review Boards (IRB) are responsible for protecting human subjects participating in research studies. To Carnegie Mellon's IRB, the critical issue related to secondary / future use of federally funded biomedical datasets is the protection of the human subjects from which the data were originally collected. Our IRB's position is documented in the Council on Governmental Relations' (COGR)13 response to the U.S. Department of Health and Human Services' (HHS) Advance Notice of Proposed Rulemaking (ANPRM), Human Subjects Research Protections: Enhancing Protections for Research Subjects and Reducing Burden, Delay, and Ambiguity for Investigators.14 The COGR supports a general future use provision in consent forms, but does not consider it a prerequisite for any and all future research use. The COGR does not support the informed consent requirement proposed by HHS for unanticipated future use and analysis of data or biospecimens collected for research. The current regulatory framework for future use of preexisting data and biospecimens provides human subjects with sufficient protection. Questions of privacy and confidentiality and assessments of risk should remain the responsibility of the IRB and be addressed on a case-by-case basis. Innovative secondary uses - for purposes different from what drove the original data collection - should not be discouraged. The COGR calls for an IRB to review repositories established for research purposes - specifically their policies and procedures for obtaining and disseminating data - to determine whether informed consent is required based on established criteria. When secondary uses are planned, the IRB should review the proposed use to determine whether informed consent is required. If the proposed use meets the established criteria for waiving informed consent, informed consent will not be required. If data in repositories have no identifiers, informed consent for future research use of these data should not be necessary because the proposed research is not human subjects research.15 Incentives for Data Sharing Researchers in different disciplines have different levels of understanding of the benefits of openness and different pragmatic needs. The NIH needs to understand its research communities, take steps to remove unnecessary barriers, and make appropriate concessions that facilitate both science and openness. The critical issue is negotiating a compromise that effectively meets researcher needs for attribution and advancement without unnecessarily diminishing the return on investment in federally funded research afforded by sharing data. Researchers must be willing to share their data. Many are not, either because they do not understand the benefits of sharing data, because sharing is burdensome, or because they - and their institutions - have a vested interest in not sharing data. Like other universities, Carnegie Mellon is heavily invested in and supportive of its research programs. We are proud of the intellectual output of our researchers, and want to protect their rights to use their intellectual output, including data. While many datasets are not protected by copyright and are not, in the legal sense, "owned" by the researchers, the researchers' de facto rights to the data cannot be denied.16 Federal policies on open data must recognize these rights and the complex and often highly competitive environment in which they exist. The use of data to advance careers, develop patents, and contribute scholarship is a top priority for researchers across the nation. Education can address researcher lack of understanding of the benefits of sharing data. Standards, best practices, and infrastructure support can reduce the burden of sharing data. (See Support Needs below.) Protecting researcher interests can be addressed with data citation standards and appropriate licenses and embargoes on public access. Compliance can be facilitated by reducing the burden of data sharing, funding the activity, and holding researchers accountable. Research shows that among both academic- and industry-based scientists, as the competitive value of the requested information increases, the likelihood of sharing the information decreases.17 Competition reduces openness and sharing, but it can also drive science and grow the economy. Competitive advantage must be preserved. To address the issues of researcher rights, scooping, competition, and potential commercial value, the NIH should acknowledge that different disciplines operate under different constraints and work with its research communities to specify the conditions and timeframe within which data must be made open. Depending on the discipline, this may be before or after peer-reviewed publication of research findings. Licenses. Ideally, licenses for open data should be human- and machine-readable. Appropriate licenses for open data include18: Open Data Commons Attribution License, which requires only attribution and grants full use rights. Open Data Commons Open Database License (ODbL), which requires attribution and share-alike, meaning any derivative work must also be open. Open Data Commons Public Domain Dedication and License (PDDL), which waives all rights and places the data in the public domain. Creative Commons CC Zero, which waives all rights and places the data or content in the public domain.19 Additional licenses might need to be developed. An appropriate license will preserve rights and provide incentives for researchers to make their data publicly accessible.20 Attempts to restrict public use of federally funded research data to non-commercial purposes will stifle innovation and commercialization, unnecessarily limiting the return on taxpayer investment in research. In regard to whether products and services developed using open data must themselves be open (i.e., must the initial data be licensed under a share-alike license), this might effectively be addressed by requiring openness if and only if the subsequent use were federally funded. In all cases, however, subsequent use of open data should require attribution of the scientists and federal agency. Embargoes. While the ideal is prompt public access to data, in some disciplines the goals of growing the economy and increasing the productivity of science might be achieved more effectively by granting researchers control of the data for some finite time, after which the data become open and competitors can use them for commercial or non-commercial purposes. This could be accomplished by requiring prompt deposit in a trusted repository, but allowing the data to reside in a dark archive until it is licensed for public use - something akin to an embargo on public access to scholarly publications.21 The point at which the dataset will become open should be specified in the administrative metadata. Compliance. Reducing the burden of data management and sharing will facilitate compliance with open data policies. Federal agencies with an open data requirement should take steps to clarify, streamline, and support the process, for example, by: Providing a common definition of "data" across federal agencies to help clarify rights and responsibilities regarding ownership, access to and retention of data.22 Working with research communities to develop and disseminate standards, best practices, and appropriate licenses and embargo periods. Providing guidance and resources to help research communities address constraints and contractual obligations (e.g., privacy, confidentiality) regarding managing and sharing data. Maintaining a list of trusted repositories for various types of data.23 Perhaps the most effective strategies for facilitating compliance will be making data management and sharing quid pro quo for future funding, and allocating grant funds that can only be spent on data management. (See Support Needs below.) Support Needs The critical issues are developing the infrastructure to support data sharing and reducing the burden of data sharing on researchers and institutions. If the infrastructure is inadequate or the burden too great, the data management requirement will not be met and the societal and economic benefits of data sharing will not be reaped. There is much work to be done to address issues surrounding data management and preservation. Federal agencies requiring data sharing should fund research and development of needed standards and tools for data curation and citation. In so far as they represent the interests of their constituent communities, federal agencies are in a strategic position to encourage the development of international standards for digital data. They can promote effective coordination of standards by working with their communities and repository developers to identify problems that standards will solve and by participating in the standards development process. Furthermore, they should monitor significant initiatives in digital preservation and disseminate relevant information to their constituencies. For example, the project Planets has built services and tools to help ensure long-term access to digital assets.24 DataCite supports data archiving that permits verification and repurposing of the data and works to establish easier access to data.25 Federal agencies with data management requirements should also provide funding exclusively for data management and preservation, both as part of existing grant programs and as a separate set of programs for retroactive preservation projects. The requirements for the disposition of these funds should allow for support of the data management infrastructure beyond the grant period, specifically: Equipment and personnel at the institution receiving the grant, thereby providing resources that can carry over from one project to the next. Trusted repositories, providing financial support to sustain these initiatives and guarantee long-term public access to the data. Given that important collaborative work and data sharing can happen with datasets of all sizes, researchers must be required to include meaningful data management plans in their grant proposals for awards of all sizes - not just large awards. Data management costs must be included in detailed budgets and budget justifications. These costs must not be used explicitly or implicitly to penalize grant proposals based on the amount of money required for data management. In closing, we encourage the NIH to consider carefully the new burden that data management places on researchers and institutions and the complex context in which this burden is faced - limited human and financial resources, inadequate infrastructure and understanding, disciplinary differences, and competing rights. To the extent that the benefits of open data exceed the costs (both personal and financial), openness should be pursued. The pace of requirements should be commensurate with the pace of developments needed to support compliance. Thank you for the opportunity to provide comments on this important initiative. 1 See http://www.cmu.edu/about/index.shtml. 2 Carnegie Mellon University Consolidated Financial Statements June 30, 2011 and 2010. See http://www.cmu.edu/finance/reporting-and-incoming-funds/financial-reporting/files/2011-annual-report.pdf. 3 Since 2007, the Association of University Technology Managers has ranked Carnegie Mellon first among U.S. universities without a medical school in the number of startup companies created per research dollar spent. Our 118 research institutes and centers create 15 to 20 new companies each year. Over the past 15 years, we helped start 300 companies, creating 9,000 jobs. 4 The Future of Taxpayer-Funded Research: Who Will Control Access to the Results? Washington DC: Committee for Economic Development, 2012. Available at: http://www.ced.org/images/content/issues/innovation-technology/DCCReport_Final_2_9-12.pdf. 5 If providing open access copies is not feasible, NIH should at minimum provide a list of relevant standards and best practices with links to where researchers can get the documents. 6 Two current development activities relevant to data citation should be monitored closely. The National Information Standards Organization (NISO) is developing a recommended practice for use of the International Standard Name Identifier (ISNI) to identify institutions. (See http://www.niso.org/publications/isq/2011/v23no3/gatenby.) The ORCID (Open Researcher and Contributor ID) project is developing unique identifiers for individual researchers to resolve name ambiguity problems in scholarly communication. (See http://www.orcid.org/.) 7 See Digital Research Data Sharing and Management (December 2011), pp. 4-5. Available at: http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf. 8 For a brief explanation see Rubin, Richard E. Foundations of Library and Information Science. New York, NY: Neal Schuman, 2010, p. 157. 9 Trusted Digital Repositories: Attributes and Responsibilities (May 2002). An RLG-OCLC Report. Mountain View, CA, pp. 18-19. Available at: http://www.oclc.org/research/activities/past/rlg/trustedrep/repositories.pdf. 10 Trusted Digital Repositories: Attributes and Responsibilities (May 2002), p. 26. 11 See Digital Research Data Sharing and Management (December 2011), p. 6. 12 This includes the federal agencies and taxpayers who underwrite the research, and the institutions that manage the grants, provide laboratory space, and pay researcher salaries. 13 The COGR is an association of 188 research universities and their affiliated medical centers and research institutes that works to ensure that federal agencies understand how the academy operates and how proposed regulations affect the academy. Carnegie Mellon is a member of the COGR. 14 The ANPRM is available at http://www.regulations.gov/#!documentDetail;D=HHS-OPHS-2011-0005- 1057. The COGR response to the ANPRM is available at http://www.cogr.edu/. 15 COGR response to HHS-OPHS-2011-005, p. 21. The COGR opposes expanding the meaning of "human subjects" to include biospecimens without identifiers because it reverses long-standing definitions and creates a significant burden for investigators, delaying if not undermining the conduct of research, without increasing protection or reducing risk to tissue donors. 16 In the United States, some types of data are not protected by copyright. For example, numeric data are treated as facts, and therefore are not copyright protected. They are, however, proprietary. In any case, the owner of federally funded research data is either the funding agency or the institution funded to do the research, not the principal investigator(s). 17 C. Haeussler (February 2011), "Information-sharing in academia and the industry: A comparative study," Research Policy 40 (1): 105-122. 18 For details, see Open Data Commons Licenses FAQ, available at http://opendatacommons.org/faq/licenses/. 19 Data placed in the public domain (with a PDDL or CC Zero license) can be hosted for free at the Talis Connected Commons. See http://blogs.talis.com/n2/cc. 20 See Digital Research Data Sharing and Management (December 2011), p. 7. 21 The National Science Foundation acknowledges that an embargo period for open data may be necessary in some cases. See Digital Research Data Sharing and Management (December 2011), p. 6. 22 Response of the Council on Governmental Relations (COGR) to the Office of Science and Technology Policy's FRI on public access to digital data resulting from federally funded research. Available at: http://www.cogr.edu/. 23 This list could be generated from registry records such as DataCite. See http://www.datacite.org/repolist. 24 Preservation and Long-term Access through Networked Services (Planets) was a four-year project funded by the European Union. See http:www.planets-project.eu. The Planets project ended in May 2010, but the documents and deliverables are being maintained and developed by the Open Planets Foundation (OPF). Government bodies may join the OPF. See http://www.openplanetsfoundation.org/. 25 See http://datacite.org/whatisdatacite.
03/07/2012 at 01:08:42 PM Organization Pfizer, Inc. New York, NY Worldwide Research and Development Pfizer Inc. Eastern Point Road Groton, CT 06340

March 7, 2012

National Institutes of Health (NIH) Working Group on Data and Informatics 9000 Rockville Pike Bethesda, MD 20892

Re: Request for Information (RFI): Input into the Deliberations of the Advisory Committee to the NIH Director Working Group on Data and Informatics: Notice Number NOT-OD-12-032

Pfizer is a global biopharmaceutical company whose mission is to discover, develop and deliver safe and innovative medicines. Our diversified health care portfolio includes biologics, small molecule drugs, vaccines and over-the-counter therapies. At Pfizer, we set high standards for quality and safety in our discovery, development and manufacturing. We are committed to developing medicines to fulfill unmet medical needs in areas such as cancer, chronic conditions and infectious disease. We thank the National Institutes of Health (NIH) for the opportunity to provide input to the Advisory Committee to the NIH Director Working Group on Data and Informatics (ref. RFI NOT-OD-12-032).

Pfizer applauds the work of the Advisory Committee and agrees with the scope of issues identified as important to consider. The benefits to the research-based biopharmaceutical industry of optimized informatics and data strategies are great and could include the following: Faster and more efficient clinical trials, leading to quicker decisions on the development of new medicines for patients Greater activity in repurposing licensed or "failed" medications for new indications The potential for "personalized competitive effectiveness research," enabled by access to larger datasets and more efficient analytic techniques, to identify patient populations where certain therapies may be less effective Enhanced R&D efficiencies based on simplified information and data-sharing across diverse therapeutic areas and/or multiple research entities.

Given the importance of efficient data management, integration, and analysis to the greater public-private R&D enterprise, we offer the following comments on selected aspects of the important issues identified by the working group. Standards development: A challenge faced by all researchers is how to best preserve legacy data to ensure maximum utility. It is often assumed that new technologies will make legacy data obsolete, or that new experiments will naturally incorporate earlier findings. This however is not always the case. Old data may be useful in new research endeavors and legacy data is not always adequately preserved for future interrogation. In an era of resource constraints data preservation will naturally be de-prioritized. Ensuring adequate data preservation requires efficient methods based on flexible standards and platforms capable of supporting long-term data use. Therefore, standards development should take into account both future data needs and the challenges of maintaining current and older systems so that all are maximally compatible. To ensure broad standards adoption, funding and grant renewal decisions should include requirements that researchers follow agreed standards for data capture and storage.

Secondary/future use of data: Addressing the important issue of data reuse is less exciting to researchers than developing new data. Academic pressures in particular can make it more rewarding to do "something new" rather than re-use existing data at lower cost. Increased cultural emphasis on efficient use of existing data in research settings is needed. Secondary use of data relies on an infrastructure with adequate storage, indexing and search capabilities. Without proper structural support, the existence of relevant data can be obscured encouraging or necessitating redundant analysis. Funding to support the necessary infrastructure for data re-use should be prioritized, especially since this can reduce overall R&D costs. Similarly, sponsors and grant-making agencies should give greater consideration to projects that employ data reuse to answer important research questions.

Incentives for data sharing: Establishing meaningful incentives to encourage data sharing is a major issue both in academia and industry. A researcher has little to no incentive to prepare data for re-use because benefits are often intangible and hypothetical. Adding metadata may enhance the utility of a dataset but it does not guarantee that the data once used, will contribute to questions posed by the originator. Establishing a currency of exchangeable "royalties," might be a viable incentive to encourage data re-use in academic settings. This system may be less amenable to use in the industrial private sector as R&D incentives and incentive restrictions are different. However if structured properly, "academic royalties" for proper preparation of data for future re-use, may accelerate adoption of best practices and should be considered as part of the grant process. If such a system is developed, it must develop a method for apportioning credit appropriately so that the "royalties" are proportional to the contribution to data preparation, not just based on seniority.

Data accessibility: Access to data itself is often not the issue for researchers, but rather knowing who is working on similar programs or in similar niches of a disease area. Data without adequate context may not be particularly useful but connections and potential collaboration with researchers in the field may be. Existing social networking tools for scientists could be used to create an expansive "expertise catalog," combining personal profiles with Medical Subject Headings (MESH) terms and grant details, providing both access to data as well as the context in which it was originally collected.

Support needs; funding for tool and algorithm development: Beyond tool and algorithm development, validation and comparison of new and existing tools and methods is important. Tools and analytic approaches are described in biomedical literature every month but the relative translational value or marginal utility of the new tool, approach, or algorithm is often unknown. Developing standards for measuring the value of new tools and algorithms, including assessment of broad applicability could be of great value to the scientific community. Support for tool or algorithm development should take into account the aggregate translational potential for each tool.

See comment 1 See comment 1 Pfizer is a global biopharmaceutical company whose mission is to discover, develop and deliver safe and innovative medicines. Our diversified health care portfolio includes biologics, small molecule drugs, vaccines and over-the-counter therapies. At Pfizer, we set high standards for quality and safety in our discovery, development and manufacturing. We are committed to developing medicines to fulfill unmet medical needs in areas such as cancer, chronic conditions and infectious disease. We thank the National Institutes of Health (NIH) for the opportunity to provide input to the Advisory Committee to the NIH Director Working Group on Data and Informatics (ref. RFI NOT-OD-12-032). Pfizer applauds the work of the Advisory Committee and agrees with the scope of issues identified as important to consider. The benefits to the research-based biopharmaceutical industry of optimized informatics and data strategies are great and could include the following: Faster and more efficient clinical trials, leading to quicker decisions on the development of new medicines for patients Greater activity in repurposing licensed or "failed" medications for new indications The potential for "personalized competitive effectiveness research," enabled by access to larger datasets and more efficient analytic techniques, to identify patient populations where certain therapies may be less effective Enhanced R&D efficiencies based on simplified information and data-sharing across diverse therapeutic areas and/or multiple research entities. Given the importance of efficient data management, integration, and analysis to the greater public-private R&D enterprise, we offer the following comments on selected aspects of the important issues identified by the working group. Standards development: A challenge faced by all researchers is how to best preserve legacy data to ensure maximum utility. It is often assumed that new technologies will make legacy data obsolete, or that new experiments will naturally incorporate earlier findings. This however is not always the case. Old data may be useful in new research endeavors and legacy data is not always adequately preserved for future interrogation. In an era of resource constraints data preservation will naturally be de-prioritized. Ensuring adequate data preservation requires efficient methods based on flexible standards and platforms capable of supporting long-term data use. Therefore, standards development should take into account both future data needs and the challenges of maintaining current and older systems so that all are maximally compatible. To ensure broad standards adoption, funding and grant renewal decisions should include requirements that researchers follow agreed standards for data capture and storage. Secondary/future use of data: Addressing the important issue of data reuse is less exciting to researchers than developing new data. Academic pressures in particular can make it more rewarding to do "something new" rather than re-use existing data at lower cost. Increased cultural emphasis on efficient use of existing data in research settings is needed. Secondary use of data relies on an infrastructure with adequate storage, indexing and search capabilities. Without proper structural support, the existence of relevant data can be obscured encouraging or necessitating redundant analysis. Funding to support the necessary infrastructure for data re-use should be prioritized, especially since this can reduce overall R&D costs. Similarly, sponsors and grant-making agencies should give greater consideration to projects that employ data reuse to answer important research questions. Incentives for data sharing: Establishing meaningful incentives to encourage data sharing is a major issue both in academia and industry. A researcher has little to no incentive to prepare data for re-use because benefits are often intangible and hypothetical. Adding metadata may enhance the utility of a dataset but it does not guarantee that the data once used, will contribute to questions posed by the originator. Establishing a currency of exchangeable "royalties," might be a viable incentive to encourage data re-use in academic settings. This system may be less amenable to use in the industrial private sector as R&D incentives and incentive restrictions are different. However if structured properly, "academic royalties" for proper preparation of data for future re-use, may accelerate adoption of best practices and should be considered as part of the grant process. If such a system is developed, it must develop a method for apportioning credit appropriately so that the "royalties" are proportional to the contribution to data preparation, not just based on seniority. Data accessibility: Access to data itself is often not the issue for researchers, but rather knowing who is working on similar programs or in similar niches of a disease area. Data without adequate context may not be particularly useful but connections and potential collaboration with researchers in the field may be. Existing social networking tools for scientists could be used to create an expansive "expertise catalog," combining personal profiles with Medical Subject Headings (MESH) terms and grant details, providing both access to data as well as the context in which it was originally collected. Support needs; funding for tool and algorithm development: Beyond tool and algorithm development, validation and comparison of new and existing tools and methods is important. Tools and analytic approaches are described in biomedical literature every month but the relative translational value or marginal utility of the new tool, approach, or algorithm is often unknown. Developing standards for measuring the value of new tools and algorithms, including assessment of broad applicability could be of great value to the scientific community. Support for tool or algorithm development should take into account the aggregate translational potential for each tool.
03/08/2012 at 09:57:28 AM Organization RCSB Protein Data Bank Rutgers University, New Brunswick NJ, UCSD La Jolla CA 1. It is necessary to clearly define the data that must be archived. This requires setting standards for the data descriptions. Unless this is done, the data will have little value to the scientific community.

2. There need to be criteria established for review and oversight of data centers and archives. Many data centers are used by very large communities and it is critical that the centers are well run. There must also be mechanisms in place to prevent disruption of service.

Both are equally important 1. When NIH requires that data be archived it must also insist on the creation of clear standards for the data descriptions.

2. New kinds of review panels and oversight committees must be established to review the data centers

We are responding to the Request for Information to the Advisory Committee to the NIH Director Working Group on Data and Informatics (NOT-OD-12-032). Our response is based on experience over many years in running shared data resources including the RCSB Protein Data Bank (PDB), the Protein Structure Initiative Structural Biology Knowledgebase (PSI SBKB), the Nucleic Acid Database (NDB) and the Immune Epitope Database and Analysis Resource (IEDB). Data Scope and Representation Scope challenges come in the form of increased data complexity (e.g., protein structures determined by multiple methods and emergent new technologies), size of the datasets (both for individual components and the corpus as a whole) and what supporting experimental data to retain (e.g., protein production data, or raw diffraction images versus processed data). While capturing primary supporting data is a laudable objective, if these data are not semantically well described, and standard in format, the impact may be quite limited relative to the resource requirements. We suggest the benefits of archiving primary data be carefully weighed against realistic estimates of user demand and reusability on current and future hardware and software platforms. Standards Development Fully describing the data requires metadata standards that are defined by and accepted by the community of data providers and users. These standards need to be extensible over time as the science advances and provenance and versioning must be provided. In the field of structural biology these metadata standards have been developed over a period of two decades with participation of data producers, users, and data archiving and publication organizations. Wherever possible stakeholders need to be incentivized to develop and support these standards as without them archived data will be effectively useless in years to come. Incentive schemes such as prerequisites to funding and publication and increasing awareness as to the value of data standards should be actively pursued. Data Accessibility Data sets need an appropriate digital signature that is broadly recognized now and into the future. Digital Object Identifiers (DOIs) seem to be the best scheme today. Provenance requires that disambiguated authors be assigned to these datasets and as of today no widely accepted scheme exists to provide this identification. The development of a digital author identifier should be addressed through funding support and mandates. Incentives for Data Sharing Data sharing mandates associated with funding are not enough. The infrastructure to support such sharing must also be available. This is particularly important for the "long tail," the many small datasets that presently do not fit well within any repository yet are necessary for scientific advancement and reproducibility. Data sharing also needs to be incentivized by current and anticipated reward systems. Associating data with the respective publications based on these data is an important first step. Data journals as well as data-journal article interoperability needs to be supported wherever possible. As scientific reporting continues to expand beyond the traditional journal into hubs and social media, data association needs to be promoted here also. Support Needs Data Curation as a profession is undervalued and steps need to be taken to improve that recognition. Likewise professionals supporting data and the infrastructure to make that data available need to be recognized and suitably supported. Scientific advancement is increasingly dependent on high quality shared data and hence support should be a priority rather than a secondary consideration. New mechanisms for reviewing data resources must be established. The criteria are necessarily not the same as for traditional research grants and the reviewers must be drawn from the various stakeholder groups so as to ensure that their needs are met. Data are international while most funding models to support data are national and this needs to change. The worldwide Protein Data Bank (wwPDB) is an example of what can be achieved to benefit the international scientific community working around the current models; however, efforts like it should have access to international funding models.
03/08/2012 at 10:36:20 AM Organization Federation of American Societies for Experimental Biology Bethesda, MD Please see attachment. Please see attachment. Please see attachment. The Federation of American Societies for Experimental Biology (FASEB) appreciates the opportunity to respond to the Advisory Committee to the National Institutes of Health (NIH) Director Working Group on Data and Informatics (WGDI) Request for Information (RFI) (NOT-OD-12-032). FASEB is composed of 26 scientific societies collectively representing over 100,000 biomedical researchers. We support the engagement of NIH in this critical effort to develop recommendations regarding the management, integration, and analysis of research and administrative data, and we hope that the current RFI will be the start of an ongoing dialogue. We urge the working group to provide the community with ample opportunity to review and provide feedback on the WGDI draft recommendations before they become policy. Below, we offer comments on issues that are of interest to the scientists and engineers FASEB represents. One area of critical interest to the FASEB community is the development of data standards. Because data differ substantially in their production, value, use, and replaceability, a variety of locally optimized standards have been developed to meet the needs of individual research communities. While "universal" standards are theoretically appealing, in practice they have proven difficult, if not impossible to implement. The WGDI must, therefore, avoid a one-size-fits-all approach and should consider a variety of data sharing models and standards to accommodate the diversity of data types. As individual communities begin to address these issues they could develop duplicative and perhaps incompatible standards, or they could work together to develop consistent, joint, merged standards. The WGDI should consider methods for encouraging and supporting the development of integrated and merged standards when the needs and opportunities arise. In addition, FASEB endorses the concept of a broad/general consent for future research. While recognizing and protecting patient rights and privacy, NIH must also take steps to maximize the accessibility of patient data for responsible use by scientific researchers. One important step would be to develop a model that allows patients to permit the use of their data in unspecified future research, and enable them to allow or disallow certain categories of research. To maximize patients' ability to control access and use of their data, the model should be based upon a graded series of choices that provide patients with clear explanations of what access and use of their data will be allowed. The current system, requiring active consent mediated by the patient's health-care provider, is inefficient and overly restrictive. We recognize that the Department of Health and Human Services is currently reviewing policies related to patient consent and that enhancing researcher access to patient data will require broad support from both research and patient communities. The WGDI should reach out to the patient community and consider their diverse concerns with regard to clinical data use/re-use. Because of broader developments in the United States healthcare system, such as the growing use of clinical imaging techniques, personalized genomics, and electronic health records, there are both significant challenges and enormous research opportunities associated with access to patient data. The patient community should be engaged sooner rather than later in this advisory process. Finally, funding agencies are reluctant to support infrastructure unless its use is primarily by investigators funded by the agency. This has resulted in fragmented and inefficient institutional mid-level information technology (IT) infrastructures (e.g., general-purpose computers, mass storage, and networking) that are more costly to develop and maintain than comparable, comprehensive, institution-wide frameworks. NIH spending on mid-level IT infrastructure must be systematic and coordinated. The Research Business Models Working Group of the National Science and Technology Council and the Federal Demonstration Partnership are both already working to streamline and improve the federal grant-making process. Working in conjunction with these bodies, NIH and its sister funding agencies should engage with one another and with grantee institutions to develop approaches for supporting mid-level IT infrastructure in a way that both meets agency needs and avoids inflicting operating inefficiencies on grantee institutions. We appreciate your consideration of our comments and look forward to working with you on these issues. Please let us know if we can be of further assistance.
03/08/2012 at 11:23:15 PM Organization div Informatics, Dept Pathology, UAB Birmingham AL NIH has fallen behind other fields and areas of human activity as concerns the enablement of data science. The writing has been on the wall at least since 2006, from the web's inventor own http://www.w3.org/DesignIssues/LinkedData.html to the federal government's http://www.data.gov/. Initiatives like the cancer genome atlas have had to wait for the demise of CaBIG in 2011 to make the data available on the web - i.e. programatic access via HTTP, as in https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/. It would be fantastic if NIH could move to the cutting edge of informatics, but right now the most important goal would be to catch up with the developments in the field.      
03/09/2012 at 10:32:34 AM Self UTHSCSA San Antonio, Texas There is a dangerous lack of recognition for the importance of maintaining relevant computing capacity and infrastructure that has been developed by individuals and small groups using NIH funding. Projects that address a large community of users who depend on its continued development and operation. My group just lost funding of an NCRR RO1 grant (now NIGMS) that scored with a 19th percentile grant and was out of funding range. No alternative/appropriate funding was available from either NSF or NIH to address this particular situation. This forced me to lay off exquisite expertise that would take years of unproductive work to replace and rebuild. Letting a programmer go that is intimately familiar with a complex program that extends over many sub-components (database, MPI/HPC, GUI, web, grid-middleware) with hundreds of thousands of lines of code is much harder to replace and retrain than a lab technician familiar with biochemistry techniques. Losing a system administrator who has intricate knowledge of the interplay of multiple parallel compute clusters, redundant and replicating relational databases, who can maintain many different platforms, and assure continued operation of a computational infrastructure that serves very high-end computational demands for hundreds of NIH investigators is a death sentence not just for my project, but for many investigations performed by the community of NIH investigators who will lose their experimental data and analysis results housed in these databases and analyzed with these supercomputers. What is needed is guaranteed continuity in the operation of such vital infrastructure so it can be at least maintained for the investigators using it, if not further developed. Our grant provided many support letters from NIH funded investigators that proved the importance of our work. I propose a specific program available to those of us who need to maintain computing infrastructure for a group of NIH investigators, students and laboratories.

Please see the attached document which illustrates our situation and the specific impact NIH's policies had on important research

I would like to see a less dogmatic emphasis on translational goals and more emphasis on basic science research. Translational science cannot succeed with the benefit of solid basic research. It has long been the responsibility of the NIH and the NSF to support basic research. Recently, with the termination of NCRR, this emphasis has shifted towards other goals whose priority is in my opinion misplaced. Comment for the Panel: There is a dangerous lack of recognition for the importance of maintaining relevant computing capacity and infrastructure that has been developed by individuals and small groups using NIH funding. Projects that address a large community of users who depend on its continued development and operation. My group just lost funding of an NCRR RO1 grant (now NIGMS) that scored with a 19th percentile grant and was out of funding range. No alternative/appropriate funding was available from either NSF or NIH to address this particular situation. This forced me to lay off exquisite expertise that would take years of unproductive work to replace and rebuild. Letting a programmer go that is intimately familiar with a complex program that extends over many sub-components (database, MPI/HPC, GUI, web, grid-middleware) with hundreds of thousands of lines of code is much harder to replace and retrain than a lab technician familiar with biochemistry techniques. Losing a system administrator who has intricate knowledge of the interplay of multiple parallel compute clusters, redundant and replicating relational databases, who can maintain many different platforms, and assure continued operation of a computational infrastructure that serves very high-end computational demands for hundreds of NIH investigators is a death sentence not just for my project, but for many investigations performed by the community of NIH investigators who will lose their experimental data and analysis results housed in these databases and analyzed with these supercomputers. What is needed is guaranteed continuity in the operation of such vital infrastructure so it can be at least maintained for the investigators using it, if not further developed. Our grant provided many support letters from NIH funded investigators that proved the importance of our work. Suggestion: I propose a specific program available to those of us who need to maintain computing infrastructure for a group of NIH investigators, students and laboratories. Furthermore, I would like to see a less dogmatic emphasis on translational goals and more emphasis on basic science research. Translational science cannot succeed with the benefit of solid basic research. It has long been the responsibility of the NIH and the NSF to support basic research. Recently, with the termination of NCRR, this emphasis has shifted towards other goals whose priority is in my opinion misplaced. Overview: Our laboratory has been developing a comprehensive, open source and multi-platform software package for the computational analysis and modeling of hydrodynamic data (UltraScan, http://www.ultrascan.uthscsa.edu). This software also contains a web-based component, called the UltraScan Laboratory Information Management System (US-LIMS, http://www.uslims.uthscsa.edu). The US-LIMS component integrates a data management component, web submission of highperformance computing (HPC) analysis jobs to remote supercomputing clusters offered by NSF XSEDE (formerly TeraGrid) and UT systems, acquires related information for experiments and experimental design, and importantly, facilitates collaboration and exchange of experimental data and analysis results among investigators from many institutions. The overarching goal of the UltraScan software suite and all of its components is to make arcane biophysical analysis approachable for technicians and non-expert graduate students so they can make efficient use of these powerful technologies. The purpose is to hide unnecessary complexity and details of the computer implementation, to present a simple and intuitive interface for complex analysis without sacrificing flexibility. Funding: The development has been funded by one NIH RO1 grant (through the NCRR), a Ruth Kirschstein postdoctoral training award, an associated ARRA award, the NSF computational biology program, and the NSF Office for Computing Infrastructure. All but the Kirschstein award expired and failed to renew despite being very close to the funding level. Software Layout: The software is organized into several modules: An extensive C++ GUI interface for Windows, Mac and Linux offers integrated data management with a MySQL database backend, pre-processing, analysis and visualization (2D and 3D), molecular modeling, and report generation. Each institution with data generation capability, i.e., laboratories with analytical ultracentrifuges, light scattering, and small-angle X-ray/neutron scattering data, which may have multiple users, have a MySQL database instance. This database stores the data according to the efficient OpenAUC standard data formats, and provides custom support for managing diverse datasets in a comprehensive data federation. A message passing interface (MPI) HPC component provides optimization and data modeling capable of running on parallel clusters. The software is currently implemented on XSEDE/TeraGrid clusters Ranger and Lonestar, on UTHSCSA clusters Alamo and BCF, on clusters in Europe and Australia, as well as several private companies who prefer to process their data internally. A web component allows analysts to access and manage data through a browser interface, and submit compute jobs to the HPC infrastructure. Collaborators and users can access the results and prepare reports of their data from data and analyses stored in the database through the web component. Data can be shared selectively according to user-selected permission levels. The GUI module and the HPC component, as well as the web component, access the MySQL database for elements needed in their task. Communication between the web component and the HPC instance occurs using the GLOBUS Gram-5 grid middleware infrastructure, and over UDP channels. This provides a queue viewer where analysts can conveniently follow the progress of remote supercomputing jobs. Impact: By conservative counts, over 300 self-reported, peer-reviewed publications have already resulted from this software development effort (http://ultrascan.uthscsa.edu/references.php). This likely represents a significant under-estimation of the true count, since many investigators forget to report their publications in our database. Searching google scholar, over 16,000 publications reference UltraScan or specific methods implemented in the UltraScan software. UltraScan licenses are renewed yearly, and currently vary between 700-1000 subscribed users per year worldwide. The Center for Analytical Ultracentrifugation of Macromolecular Assemblies (CAUMA), a service core facility for hydrodynamic experiments, which uses the UltraScan analysis software and has its own database, reports 438 investigators who have performed 2795 experiments containing a total of 5325 samples over the past 8 years of its operation. Currently, nearly 70 institutions in addition to CAUMA rely on this infrastructure, and will be impacted by a loss of this infrastructure that NIH does not agree to fund now.
03/09/2012 at 12:29:37 PM Organization University of Pennsylvania Philadelphia, PA 1. Data standards, reference sets 2. Support needs

Data explosion fueled by the advances in sequence technologies has influenced our view and methods to study biology greatly. Large-scale data analyses using computational approaches have emerged as a useful tool not only providing a genome-wide view of the data but also suggesting new hypotheses that can be further proven by experimental validation. It is expected that data-driven approaches will be more popular in the near future. Large-scale efforts to generate reference data such as Roadmap Epigenomics Mapping Consortium as well as ENCODE (Encyclopedia of DNA Elements) project have contributed to the accumulation of cell-type specific dataset as well as the development of computational algorithms. We are expecting more cell-type specific data. For example, International Human Epigenome Consortium (IHEC) aims to map 1000 reference epigenomes with in a decade.

In a near future, more epigenomic data will be generated to show epigenomic variations after treatment or diseases. The data accumulation needs to be well coordinated to make the generated data "referencible" and not to waste data generation efforts. To generate well-curated reference data for translational researches, a specialized consortium for each subject may be required. Each consortium could decide the cell-type (including treatment, time, condition, etc) and antibody carefully considering the impact of the data generation on the field. A good model is a Beta Cell Biology Consortium, where selection of cell-type, treatment and antibody is well discussed. Also, we need to foster computational biologists specialized in each field. Computational approaches have become a powerful to study biology. As the number of dataset increases, computational models need to handle sophisticated experts' knowledge in designing a computational model and selectively integrate useful data in the deluge of information. To be competent and successful in each field in biology, it is often required that algorithm should be well designed according to the specific questions rather than designed for general purpose. Specialized computational biologists are required in this regard. Often, many organizations in NIH do not see computational approaches as a driving force. This is in part there has been not so many successful cases based on computational approaches. On the other hand, this may be because there has been not so much effort to support new ways of studying biology. More and more researchers in computer science have accumulated experiences in biology. Also, national-wide efforts has produced researchers trained both with biology and computer science. Among them will be researchers who will play a leading role to the study of biology. Supporting computational approaches for in the specific field will facilitate the algorithmic development in each section of biology. Therefore, I think NIH sub-organizations need to support computational approaches to foster computational specialist in each field. As a result, each organization in NIH will be benefited by the accumulation of data in quality as well as well designed analytical methods.

I think analytical and computational workforce growth is more important. Generating reference data has been discussed in many ways. However, fostering computational work force has not been discussed and their support from each NIH sub-organization is very limited. There should be a way to systematically support algorithmic development and foster experts who can contribute to generating data-driven hypothesis in each filed in biology. NIH can support algorithmic developments by supporting independent computational approaches or collaborational efforts with computational development in each filed of biology where analyzing large-scale data is required. These efforts can results in success cases and foster independent PIs who can lead the projects with the data-driven hypothesis.  
03/09/2012 at 12:42:02 PM Self     A critical issue in the use of genomic data to drive personalized interventions/medicine is the interpretation; i.e. going from the $1000 genome to the 1,000,000$ interpretation. In addition, there is a lot of non-genomic data associated with patients, such as imaging, EEG, biochemical workup and clinical situation and/or reaction to therapeutic interventions. I believe Quantitative Systems Pharmacology (which is different from Systems Biology) is a possible way to encompass many of these different readout modalities. Systems Biology is not appropriate to combine different modalities because that necessitates constraint by neurobiological understanding. By simulating and modeling actual physiological processes in a biophysically realistic way, we might start to include all that information in a actionable framework. Besides providing a real translational tool, because it is based upon both preclinical physiology and clinical data, such an approach could also be helpful in possibly identifying new targets and testing new hypotheses in the clinic.     PDF copy of article "Mechanistic Disease Modeling as a Useful Tool for Improving CNS Drug Research and Development" by Hugo Geerts, published by Drug Development Research in 2011
03/09/2012 at 03:25:53 PM Organization Group Health Research Institute Seattle, Washington An area that we believe is worthy of consideration is that of secondary use of data. Legal and ethical considerations complicate the sharing of data for purposes outside the original aims. For example, subjects consent to the use of their data for a specific study; current institutional review board regulations prevent use of those data on other studies. Thus, guidelines and standards around comprehensive patient consent procedures that allow secondary use of data and data sharing are needed. Further, the extent to which data could be shared is constrained by questions of ownership of the data. Funders may feel that taxpayers supported the creation of study-specific data, so that NIH would own the data on behalf of taxpayers. However, in cases where researchers work at health care organizations and build datasets based on the organizations' data, the parent company may reasonably argue that they own the data, and that NIH's contribution was a modest value-add. Health care organizations will have a need to shelter their data to protect their business from competition and from reputational risk and a duty to safeguard the confidentiality of their patients. Scientific investigators also have a stake in the ownership of the research data; since they invested their knowledge - including knowledge acquired outside of the study-specific work - and since their careers are driven by successful competition with other knowledge workers, they will be reluctant to share valuable research data without some incentives for data sharing or benefit for doing so.

Another area worthy of consideration is that of unrealized research benefits. A specific example of this is the wealth of new clinical data-as both structured data fields and unstructured text-being captured by electronic medical records (EMRs). Before the advent of EMRs, detailed information from medical records could be extracted only by manual chart abstraction. EMRs rich offerings make possible research on larger cohorts and make possible research on topics not previously amenable to computerized research. However, exploration of EMRs takes time and takes development of some new skills. Until researchers know what is accessible in the new EMRs and until they have experience in judging how much time it takes to use the new data, they are not able to include in a grant proposal the commitment to work with these new data. And without grant support, researchers do not have the funding to support exploration of these new areas.

Also worthy of consideration is the subject of models and technical solutions for distributed querying. Several new tools to support distributed querying are currently being developed. If those tools could be set up to share data across sites, this could streamline feasibility explorations and so help researchers hone in quickly on workable research questions. NIH could support the use of distributed querying tools by helping to put in place appropriate protections of shared data and appropriate incentives for sharing.

A fourth area that we believe is worthy of consideration is improved efficiency of data access requests. The process of getting IRB approval takes valuable time of investigators, project managers and IRB staff and causes delays in starting work on projects. For data-only studies that are of very low risk to subjects, a more efficient approach seems warranted. One such approach might be to designate de-identified data repositories as resources that could be used by qualified local staff without IRB approval, or with the approval of IRB staff without the full committee and without the full complement of applications and project descriptions necessary for traditional IRB approval. It would be helpful if NIH could offer funding to support the development and streamlined use of such resources.

We identified two issues which are critical for NIH to address. One is that of winning over scientific investigators and institutions to the practice of data sharing. Without their support, data sharing will not move forward because they control use of data. Developing incentives and providing support and streamlined processes could go a long way toward gaining the cooperation of investigators and institutions.

The other critical issue for NIH to address is that of support, especially funding for tool development, implementation maintenance and support, and algorithm development. Research could become much more efficient and much more standardized if research institutions had sufficient and dependable support for data infrastructure. Ideally, an institution could have a data environment, including a data warehouse, query tools, documentation, and a set of standard procedures and algorithms that could be used by all research projects within the institution. If such an environment were created and maintained as a cross-project resource within the local institution, the efficiencies realized would accelerate the development of new data areas, tools and methods that would allow project teams to expand their scope of research without increasing cost. Support for a centrally maintained cross-project data environment would also make possible regular quality assurance checks. In contrast, the current funding mechanism supports project work but is actually hostile to development of the data infrastructure on which projects are based. Projects are able to develop only the specific data needed for the project at hand, but are not able to extend their work to develop infrastructure that could easily be used on other projects. If it were possible to devote more time to development of shared tools, any research project which used such a tool would reap an immediate benefit, and future project work could be carried out more quickly and at lower cost. Similarly, if it were possible to devote more time to quality assurance of shared data resources, any particular project could be more confident of the data it was using, any errors discovered could be more quickly addressed, and the fixes would benefit all subsequent projects. As the variety and complexity of data sources grows, and as more and more researchers attempt to use these complex sources, the need for more standardized approaches becomes more pressing. If NIH were to find a mechanism for supporting development of shared tools, shared processes, development of new data sources and regular maintenance and quality checking, this would make reproducible, high quality, quick turn-around research more achievable.

One way that NIH could address the issue of funding for tool development, implementation, maintenance and support, and algorithm development would be to include in every funded project a fraction of funding to be devoted to infrastructure development, with an understanding that the development would go beyond the needs of the specific project. Although this would appear to raise the cost of any particular project, in fact, over time, the development of more robust systems would make research more efficient.

A significant disincentive to sharing data is that making data available to others increases the competition among investigators using a given data source; that is, the one who shares the data then competes with those with whom s/he has shared. We suggest that sharing could be incentivized by offering an adjustment to scores on future grant proposals. This would seem fair: the person who helps his/her competitors by providing data would himself/herself be given some help in future competitions.

A problem that NIH may be able to help address is the question of how to provide funding to support data sharing. On occasion we have been asked to share data after a study is completed and the budget closed. Even if we are then given a modest amount of funding from a different source to cover the cost of creating the datasets and documentation of sharing, we do not have a good mechanism for handling small amounts of funding not tied to an existing project. The time it takes to execute a contract for such a data-sharing arrangement is very costly compared to the amount of funding we are trying to collect. It would be helpful to find a better mechanism to support the funding of data sharing.

Another area of concern is that even when organizations want to share data with each other, the Institutional Review Board restrictions and Data Use Agreements (DUAs) on multi-site projects become a significant barrier. Sometimes the differences in verbiage in DUAs at various sites require much negotiation to resolve, so the actual sharing of data is significantly delayed. Some DUAs prohibit sharing data with for-profit entities, or in some other way restrict sharing. It might be helpful for NIH to provide or officially endorse a set of Uniform DUAs or other form of standard language for DUAs to increase the likelihood that the various organizations on a multi-site project would all adopt the same language in their agreements, and would all be starting from a position favoring data sharing.

Text of attachments same as that of comment boxes.
03/09/2012 at 07:01:04 PM Self     1. Data sharing--I start with the presumption that sharing is desirable, so the issue is what to mandate for NIH funded research and how to enforce those mandates. I realize that there is an NIH policy on data sharing, but the committee should revisit this policy to see if it should be strengthened. I believe it probably should be. 2. Sharing human genomes--The economics of sequencing relative to the costs of health care, combined with the potential of personalized medicine, lead to the conclusion that millions of human genomes in the US can and should be sequenced and connected with phenotypic markers in a way that preserves individual privacy, so that they can be mined for research related to personalized medicine. Issue #2 above is the most important single data issue that I can think of, in terms of the future progress of biomedicine. A major reason for the drying up of the pharmaceutical pipeline, and a major limitation in effective use of existing pharmaceutical therapies, is the inability to know which individuals will be harmed or beneficially effected by particular drugs. Side effects for a fraction of people keep drugs that would benefit many people off the market. Inability to predict effects of particular drug choices limits physicians' ability to choose properly among alternative therapies. The issue is not limited to pharmaceuticals. Genetic differences influence individual variation in response to diet, physical activity, exposure to environmental factors, and even surgical procedures. When we do mouse experiments, we use mice that are genetically practically identical with each other, because the genetic differences would confound our results. Yet in many ways we practice medicine and evaluate therapies in clinical trials as though human genetic variation is not an issue. The economically feasible availability of individual human genomic data implies that we now can, and for the sake of optimizing future progress must, face this issue. Highest possible priority to meeting the need to utilize individual human genomes in basic and translational research and cinical practice.  
03/10/2012 at 05:25:32 PM Self     I will suggest two issues that I believe to be of critical importance in your work: the creation of a well-supported data framework with open source software, capable of handling high data volumes and high data rates efficiently, and changing the focus of the discussion from a funding-agency-by-funding-agency discussion to a discussion that crosses agency boundaries The area about which I have the greatest concern is the handling of massive flows of raw data collected at major facilities for use elsewhere, and perhaps to be reduced and combined with data collected elsewhere. This is best exemplified right now by x-ray crystallography for structural biology done at synchrotrons, where pixel array detectors are producing 250 megabytes per beamline per second of data now and are headed towards 5-10 gigabytes per beamline per second in a very few years. As computers and scientific instrumentation continue to improve at a rapid pace, there is every reason to expect similar and larger data rates from many science-related disciplines. Unfortunately, it is precisely the success of the science and engineering efforts that allow this massive flow of data to be generated that has given our community a false sense of security that the science and engineering that will be needed to manage and digest this flow will be there when it is needed and need not be explicitly planned for and funded. Please see the attached memo for more detail. In this memo, I will suggest two issues that I believe to be of critical importance in your work: the creation of a well-supported data framework with open source software, capable of handling high data volumes and high data rates efficiently, and changing the focus of the discussion from a funding-agency-by-funding-agency discussion to a discussion that crosses agency boundaries. My background is in mathematics, computer science, bioinformatics and structural biology. Of particular relevance for this discussion I have been involved with imgCIF and HDF5. I have been handling issues of scientific data collection and supporting software for more than four decades. As you have so well articulated, there are many important issues arising from your charge to do a "thorough and comprehensive evaluation of the issues surrounding the management, integration, and analysis of research data and administrative data." The needs of the NIH in these areas clearly have much in common with the needs of other funding agencies, such as NSF and DOE, particularly when one looks at the data for structural biology. This provides an opportunity for highly leveraged efforts but also the risk that certain essential elements will "fall between the cracks" being unfunded and undone. The area about which I have the greatest concern is the handling of massive flows of raw data collected at major facilities for use elsewhere, and perhaps to be reduced and combined with data collected elsewhere. This is best exemplified right now by x-ray crystallography for structural biology done at synchrotrons, where pixel array detectors are producing 250 megabytes per beamline per second of data now and are headed towards 5-10 gigabytes per beamline per second in a very few years. As computers and scientific instrumentation continue to improve at a rapid pace, there is every reason to expect similar and larger data rates from many science-related disciplines. Unfortunately, it is precisely the success of the science and engineering efforts that allow this massive flow of data to be generated that has given our community a false sense of security that the science and engineering that will be needed to manage and digest this flow will be there when it is needed and need not be explicitly planned for and funded. There is a common and mistaken belief that all one needs to do is to choose or invent some suitable format and, with disks and computers becoming cheaper, just wait for sufficient hardware to support any given data rate to become available and affordable. This has resulted in unreadable data sets or, worse, misread data sets and many terabytes of lost data that could have been retained with better planning. We need a community-wide effort to define an agreed data framework within which multiple data formats, many of which will not be defined until many years from now, can be properly supported. We need fully funded software engineering efforts to create and maintain open software and to maintain central and distributed archives of metadata ontologies that follow (or, better, lead) the changes in formats and metadata terms within the framework. We also need the hardware infrastructure to move and store actual data. Of these needs, only the last, the hardware infrastructure, has an immediately available feasible solution - make more use of the excellent commercial data storage and networking facilities that have grown over the past decade. Adequate funding for the storage of data is an essential aspect of any solution. Costs now range from $300 to $3000 per terabyte per year. The rest - the framework for formats and metadata and open, robust software support - are crying needs we would do well to address, and address promptly. Real solutions will take years of intensive effort to perfect. You have identified issues related to these questions, but the reality is that, at present, no funding agency has the responsibility and resources to do the very real, detailed work needed to create an agreed common physical and software infrastructure for practical long-term management and archiving of the data flows we are now seeing, much less the data flows that are coming soon. The solution I have most often heard discussed is to give up on preserving the massive raw data flow, reduce the raw data at each beamline, produce structure factors or even solved structures immediately and simply discard the "unmanageable" raw data. This is, unfortunately, the best we are likely to be able to do in some cases, but making it the default approach is a mistake. One need only look to the recent effort to inject fraudulent structures into the PDB to realize that we need to at least try to preserve the raw data. But if we are to preserve the raw data in a meaningful way, we need to reliably associate it with its metadata and put it into a well-documented format supported by open, documented software that achieves the high performance needed for these data rates. There is no quick, easy, permanent solution to this problem. Certainly there are available software-supported formats. My own experience is with imgCIF and HDF5. There are others. Without hardening and support, no existing format will survive the inevitable changes in requirements. Without a long-term funding model, whatever we put in place will suffer from bit-rot. Without periodic community outreach, any raw data format effort risks drifting into irrelevance. Right now different countries and different scientific organizations are struggling to find the right way forward. We have had multiple workshops. We have multiple committees. Perhaps the result will be a solution. I suspect not. Until long-term funding is provided to address this issue, we will lose valuable data, both by discarding it and by "preserving" it in forms we discover we cannot read later. I urge NIH to work together with NSF and DOE to set aside some reasonable amount of funding for the creation and support of an agreed common data format with, robust, high performance, open software. I recognize that in the current economic and political environment, it is not feasible to do more than start on such an effort, but, if it starts well, it may well be possible to interest large commercial interests, such as Google and pharmaceutical companies and detector companies, in contributing to the long-terms costs of such an effort and perhaps towards the much larger expenditures needed for a real world-wide archive of raw scientific data.
03/10/2012 at 10:37:25 PM Organization OBI Consortium Not applicable A looming critical issue for the management, integration, and analysis of large biomedical datasets is that a new data silo is generated with each data-generating project or consortium. Standards are recognized as essential to avoid these silos and enable efficient and effective data sharing. Much discussion has rightly been held regarding the meta-data (such as in minimal information checklists) and in formats. Shared semantics are also critical to be employed so that the meta-data associated with datasets can be understood, searched, and used for learning. Much attention has been given to the semantics of particular types of meta-data such as describing the anatomical source of a sample used to generate a dataset or the disease state of that source. However, equal attention needs to be provided for the semantics of the experimental investigation that generated the dataset including the objectives, the protocols, the collection of specimens, the assays used, the data processing, and even the people involved and the roles they played. As the number of datasets grows, text-based searches to find the instruments, reagents, cell lines, biospecimens and services utilized will be limited and incomplete without a common semantic framework. Understanding the experimental evidence for conclusions drawn on datasets is also not scalable without a common semantic framework as each would need to be examined one at a time. These issues impact everyone trying to make use of datasets and associated specimens such as clinical researchers struggling to relate specimens from different biobanks using different terminologies, scientists in consortiums sharing data within and trying to consume data from other consortiums, or institutions trying to simply inventory what datasets have been generated by their funded researchers. The NIH should address how to avoid the continued creation of data silos by supporting and promoting the use of a common semantic framework where possible. Importantly, rather than supporting the development of novel 'standards' which struggle to find adopters, the NIH should support community-initiated efforts for standardized data representation. The Gene Ontology was developed by multiple model organism database developers who saw the benefits of collaborating on a common standard. Its wide adoption demonstrates the success of data standards developed collaboratively by researchers trying to solve practical problems. Under the umbrella of the OBO foundry, several such community-driven ontology development efforts exist. A central component of the OBO Foundry is the Ontology for Biomedical Investigations (OBI, Brinkman et al. J. Biomed. Sem. 2010), which provides terms with precisely-defined meaning to describe all aspects of how biomedical investigations are conducted. OBI was established to address the needs of its over 20 member communities (these can be found at the OBI web site: http://obi-ontology.org/ and on the NCBO BioPortal: http://bioportal.bioontology.org/ontologies/1123). These communities realized that they needed a standard for representing experimental data that works across all types of experiments, rather than standards for individual techniques such as microarrays or flow cytometry. There was also recognition of the need to cover more than just 'omics research and to include translational research. Through a self-organized multi-year effort, OBI is now sufficiently stable and broad to support multiple applications of contributing community members. Using a semantic framework such as OBI would remove a currently nearly insurmountable barrier between scientists from different communities and consortiums sharing data and resources. At present, a barrier to widespread adoption of OBI (and other similar community-developed ontologies), and in particular for adoption in a software production environment, is the lack of dedicated support for OBI users. Current OBI users are primarily also OBI developers. With dedicated funds available, community support could be provided in terms of training, outreach, error fixing, and enhancing the ontology based on user requests. NIH currently requires data sharing policy to be described for certain grants usually associated with generating large datasets. As part of the policy some mention of the semantics associated with the meta-data should be indicated. Of course, for such policies to achieve data sharing and integration, there must be an available relevant ontology (or ontologies such as OBI) covering the meta-data to be put to this use. Therefore, support for the development and maintenance of such an ontology (or ontologies) should be considered an NIH priority.  
03/12/2012 at 09:26:05 AM Self     see attached document     Scope of the challenges/issues Research information lifecycle Challenges/issues faced by the extramural community Tractability with current technology Unrealized research benefits Feasibility of concrete recommendations for NIH action Major current challenges include diversity of source data, incomplete provenance, insufficient data coordination capabilities, sociological/cultural resistance, and insufficient funding to accommodate necessary scope and depth of individual and interlinked studies. There are considerable differences across research groups in technological sophistication of hardware and software infrastructure, resulting in major incongruities in scale, format, and type of data collected, presenting a formidable challenge to data standardization and sharing. These technological limitations can in turn magnify at the level of centralized archiving, where the demands for operations standardization and informatics support capacity intensify. Specifically, complex computational infrastructures are needed that not only serve as repositories of or access portals to large-scale databases integrating multiple scales and types of data, but also provide critical functionalities such as flexible access control, logging, security, privacy, flexible interrogation, and, in the best of all possible worlds, sophisticated analysis pipelines, multidimensional visualization, and statistical toolkits. To be effective, data management resources should achieve two central goals, both vital to the reiterative data analysis and reuse that underpin scientific discovery and advancement: (1) link raw, preprocessed, postprocessed, and analyzed data with complete descriptions of what was done to the data once uploaded, including software, parameters, etc; and (2) tightly couple tools to the data, including data management, analyses, and dissemination formats and capacity for customization of data sets. The databasing, querying, scrutinizing, and processing of data sets from multiple subjects necessitate efficient and intuitive interfaces, open architecture to facilitate data mining, and comprehensive data descriptions, dictionaries, and provenance with a view toward promoting independent re-analysis and study replication. Minimal requisites, the absence of which defines barriers to sharing, include site-to-site networking requirements; user authentication protocols; unique, confidential, and HIPAA-compliant subject identifiers linking individual subjects to both new and existing corresponding data; data ontological definition, description, and management. Systems also need to provide flexible integration of data from multiple sources, with an interoperable, user-friendly, expandable database and web-based data entry system to facilitate the design, implementation, and harmonization of projects; and large-scale data storage capabilities. Moreover, 24/7 data uploading and downloading must be possible across the network, including secure password-protected web access. Provision of help-desk support via e-mail and/or web site is also a critical system component. Finally, data analytic support must be provided as needed to facilitate future discovery and validation projects. Standards development Data standards, reference sets, and algorithms to reduce the storage of redundant data Data sharing standards according to data type (e.g., phenotypic, molecular profiling, imaging, raw versus derived, etc.) In the context of computationally intensive science conducted in highly distributed network environments across a diverse range of population-based investigations, effective coordination of data from multiple studies and sites as well as interfacing with other centralized data hubs begin with comprehensive development of common rules for specifying what data are encompassed in the archive, the names of the data and scheme of partition, and the online access pathways for system users. At a minimum, the efficacy and efficiencies of e-infrastructures depend on organizational prowess, comprehensive data description and provenance, and capacity to execute singular queries across multiple databases. Coordination of acquisition sites for data uploading is a key factor, as is coordination of databases (or synchronization mechanisms if a federated archive is deployed) by data type, e.g., image data vs. genetic data. Biospecimen banking maybe optimally conducted elsewhere or separately from the data coordinating center, with the biomaterials enterprise interlinked for data access and integration as needed by project or user. All of these key factors ultimately converge on the single but multi-faceted issue of trust-specifically, confidence in the data archive and its supporting systems. That "consumer" confidence results from applied expertise on several levels: First is data flow, as the acquisition sites collect study data and enter/upload them into the coordinating center repository through an archive interface, along with quality control and any preprocessing of the data. Next is data integration, for coordinating centers that handle multiple types of data. For example, a subset of data from a clinical database can be integrated with an imaging data archive to support richer queries across the combined set. Other models may create heterogeneous but integrated archives housing all data types. Infrastructure requirements include a robust, flexible, and reliable infrastructure for supporting a resource intended to serve a global scientific community. The infrastructure must provide high performance, security, and safety (e.g., backup) at each level, as well as precision accounting of data provenance, feedback (e.g., system status), and comprehensive logic and range checks at upload for data quality assurance (e.g., subject age changes in longitudinal studies, range limits in clinical variables). Simplicity in ease of use is a virtue, for all data uploaders and downloaders, enabling both batch- and single-mode processing. Fault-tolerant infrastructure cannot have any single points of failure, catastrophic or otherwise. Secondary/future use of data Ways to improve efficiency of data access requests (e.g., guidelines for Institutional Review Boards) Legal and ethical considerations Comprehensive patient consent procedures The database access committee, a standard oversight body requisite for every e-infrastructure, is optimally multidisciplinary in composition for proper and informed vetting of all access requests across the disciplinary spectrum of potential users. While access to de-identified data should be free and open as possible to all, responsible administrative governance includes review and record-keeping on proposed use of the data by each access applicant. Restrictions on independent redistribution of data by users are also advisable as a measure to ensure that (1) the data remain clean or free from corruption, as frequently happens in independent retransmissions by parties ill-equipped in hardware and software capacity to handle such activities; (2) there is proper acknowledgment of the originating centralized database/source; and (3) there is accurate record keeping/accounting on data usage, which is important for documenting and benchmarking access activity levels for current and future funding support of the e-infrastructure. Overall, several factors contribute to database utility, including whether it actually contains viable data along with detailed descriptions of acquisition (e.g., meta-data); whether the database is well organized and the user interface is easy to navigate; whether the data are derived versions of raw data or the raw data themselves; the manner in which the database addresses the sociological and bureaucratic issues that can be associated with data sharing; whether it has a policy in place to ensure that requesting authors give proper attribution to the original collectors of the data; and the efficiency of secure data transactions. These systems must provide flexible methods for data description and relationships among various meta-data characteristics. Moreover, those that have been specifically designed to serve a large and diverse audience with a variety of needs and that possess the qualities described above, represent the types of databases that can have the greatest benefit to scientists looking to study a disease, assess new methods, examine previously published data, or explore novel ideas using the data. Newly received data should be placed into quarantine status, and queued for quality assessment. Metadata elements can then be extracted and inserted into the database to support optimal storage and access, and customized database mappings can be created for various file formats. Once raw data undergo quality assessment and are released from quarantine, they can become immediately available to authorized users. The same process can be used for post-processed data, enabling analysts to share processing protocols via descriptive information contained in the XML metadata files. To facilitate the discovery process through secondary analyses and data repurposing, database access is optimally free of charge to authorized investigators, regardless of location or primary discipline, with costs of data management and curation underwritten by each e-infrastructure funding source(s) (mostly, NIH), at realistically sufficient levels of funding support. Fee-for-access, even by a sliding scale arrangement, encumbers discovery science by limiting it to the financially privileged. Establishing and maintaining a level playing field in access, scientific community-wide, is thus vital to the data informatics or e-structure enterprise. Data accessibility Central repository of research data appendices linked to PubMed publications and RePORTER project record Models and technical solutions for distributed querying Comprehensive investigator authentication procedures Common roadblocks to access and use of data in a distributed network are the failure to secure a consensus distribution policy prior to study start-up and inadequate infrastructure capacity in data service. For example, there may be significant differences across prospective sites and investigators with respect to preferred format or interface. In other cases, there may be consensus on policy but the projects have only primitive systems for dissemination, hampering access and use. The latter scenario may also reflect system scalability, manageability, and stability performance issues, as well as absent or unreliable real-time fail over and load balancing. Beyond access, inadequate or unreliable data search and interrogation tools block or limit the usability of a data base. Tools must be specifically developed to provide support for designing data queries to facilitate database access, utilization, and data mining; assist with use and understanding of common data elements; and promote open architecture to enable software development for data mining. Essential outreach is needed to encourage data standardization, data sharing, and usefulness of the resource; and a help desk is needed to provide electronic and live assistance to all users of the resource. To ensure that clinical data are maintained securely and that subject identification is protected, the data archive must be designed to achieve (1) HIPAA-compliant protection of patient privacy through integrated data de-identification components; (2) strict and comprehensive access controls to ensure data are accessible only to authorized individuals; (3) tracking of all data accesses to provide an audit trail so that project managers may understand who accessed their data and in what way; and (4) an informed and efficient governing body such as a Data Sharing and Publication Committee. Optimally, different levels of user access control the system features available to an individual investigator. All data uploads, changes, and deletions should be logged. An online application and review feature should be integrated into the data archive so that applicant information and committee decisions are recorded in the database. In addition to controlled, tiered access to de-identified clinical data, to augment the network-based security practices and to ensure compliance with privacy requirements, the hardware infrastructure servers should utilize SSL encryption for all data transfers. Moreover, post-transfer redundancy checking on the files should be performed to guarantee the integrity of the data. Incentives for data sharing Standards and practices for acknowledging the use of data in publications "Academic royalties" for data sharing (e.g., special consideration during grant review) Reward systems for sharing must be infused into the community. Hoarding and brokering data must take a backseat to open and source-acknowledged sharing. Mechanisms to identify and discourage counter-productive behavior warrant consideration, such as highly constrained justification criteria for minimal to no data sharing (territorial reduction). Overall, the requisite data sharing plans in grant applications do not do much with respect to intended results. There are many approaches to stalling, sharing incomplete and worthless data, or ignoring the plans altogether. The most often heard refrain is that the requirement is an unfunded mandate. To a large degree this is true. Data sharing can be expensive, and budgets are now always cut with either surgical precision by the study section or by the blunt force of total percentage reductions in this ever-presently lean fiscal climate. In absence of a NIH commitment to funding the informatics infrastructure needed for comprehensive and sustainable data sharing, alternate and potentially less-effective incentives such as "academic royalties" are the untested default. Support needs Analytical and computational workforce growth Funding for tool development, maintenance and support, and algorithm development The need for increased workforce capacity follows directly from the critical administrative duties of the data management resource. Foremost, it is vital to help investigators manage portions of each study. These include a set of Data User Management tools for reviewing data use applications, and managing and tracking manuscript submissions and investigator progress reports. Also needed are Project Summary tools that support interactive views of upload and download activities by site, user, and time period, and provide exports of the same. Other information, documents, and resources geared toward providing data analysis and summary reports to steering and other oversight committees and funding agencies, status of the study and data available in the archive, are best provided through the project web site. Ultimately, an effective database will serve to blur the boundary between the data and the analyses, providing data exploration, statistics, and graphical presentation as embedded toolsets for interrogating data. Interactive, iterative interrogation and visualization of scientific data leads to enhanced analysis and understanding and lowers the usability threshold.
03/12/2012 at 10:35:38 AM Organization The FaceBase Management and Coordination Hub Pittsburgh, PA and Iowa City, Iowa Scope: The challenges that we face are both technological and behavioral. Although distributed querying and data federation issues present very real difficulties, the concerns we face in FaceBase are more technologically tractable but behaviorally more daunting.

Perhaps our key challenge is the lack of the metadata needed to support search and integration of data sets. Detailed descriptions of the organisms, genetic backgrounds, experimental platforms, protocols, and other relevant aspects are vital for accurate cataloging and retrieval. These descriptions should use accepted terminology (ideally from biomedical ontologies) and accepted formats whenever possible. Unfortunately, many data sets lack well-formed metadata. These shortcomings are a reflection of the data management tools available to biomedical researchers. Biomedical researchers far too often rely almost exclusively on spreadsheet software and ad-hoc methods to store, annotate, and organize data. These approaches do not encourage the development of high-quality, comprehensive metadata.

Although well-designed tools might help promote better annotation, successful deployment and use of these tools will require overcoming the almost unsurmountable obstacles of convenience and low cost. Every researcher has spreadsheets at their fingertips. NIH should consider both technical and social approaches to encourage the generation of high-quality, machine-readable metadata Support for tool development will lead to software that will help scientists create high-quality, well-annotated scientific metadata for their data, increasing the likelihood that it will be reusable with minimal curation. For these tools to succeed, scientists will also need motivation to put in the (hopefully minimal) extra effort required to effectively describe their data. Behavioral and economic consideration of the factors that may encourage such effort are a necessary complement to tool development.

Standards: Community standards for data formats, terminology, and expected metadata (sometimes termed "minimal information") provide substantial, but limited, structure and guidance in support of data integration and re-analysis. Standards are also inevitably limited, as constraints of pre-defined data structures and terms run up against the dynamic and evolving nature of innovative science. The NIH should encourage the use of flexible formats and extensible ontologies that can easily be adapted to meet the needs of specific contexts. End-user tools and clear documentation will encourage adoption. The promotion of evolving online publishing convention that encourage semantic connection between online resources ("linked open data") will leverage the power of ontological annotations in support of more effective data integration.

Secondary/future use of data: Greater clarity regarding rules, regulations, and procedures for managing and protecting data - particularly potentially sensitive human data - will decrease the cost of sharing data and promote future use. A lack of clear guidelines for IRB procedures that would enable future use of data, empower data access committees, and promote best practices for data security leads to substantial confusion and cost for both data dissemination portals such as FaceBase and for collaborative exchange between individual investigators. The NIH should engage in shared policies and guidelines for data sharing, data access committees, and human subject protections, with the goal of reducing the cost of data sharing without compromising the protection of human subjects. These efforts should be coordinated with ongoing engagement in related community efforts, such as the Department of Health and Human Services' reconsideration of Common Rule human subject protections.

Data accessibility: Ongoing technical efforts provide encouraging visions of infrastructure support for data sharing, reuse, and integration. Semantically-annotated linked open data, citable data repositories such as Dryad (http://datadryad.org/), and improved cataloging of publication components from raw data to figures and appendices all encourage the discovery and reuse of biomedical data. NIH should support and promote these efforts, through funding priorities, data sharing policies, and the development of tools that will simplify adoption and therefore encourage participation.

Incentives for data sharing: The NIH should promote data sharing policies and incentives that will encourage data sharing. Without such incentives, researchers may see data sharing as an overhead activity, requiring time and effort with little reward. This perception will not encourage development of high-quality metadata. Clearer practices for measuring the impact of contributed data, measurable attribution through emerging "nanopublication" efforts, and consideration of data sharing efforts during grant review are potentially valuable approaches. To maximize impact, NIH should work with the academic and corporate communities to ensure that these incentives are broadly reflected in tenure and promotion evaluation procedures that impact career trajectories of biomedical researchers.

Support needs: The development of tools and infrastructure for data sharing will require the availability of a workforce well-trained in computer science and informatics. Unfortunately, current academic employment models do not support the hiring and retention of these crucial workers. Salaries for both faculty and research staff with computational backgrounds often lag behind not only industry, but also other academic fields, such as computer science. The NIH should develop models that will help biomedical researchers hire and retain technically trained personnel at all levels.

Particularly given these staffing difficulties, tool development and dissemination should also be a high priority. Open-source distribution of data management and annotation tools is a necessary, but not sufficient, first step. To be truly useful, tool development must pay explicit attention to technology transfer, including the documentation, training, and support needed to encourage scientists to adopt new tools and technologies. The NIH should support both the development of advanced tools and the support needed to effectively encourage community adoption. Funding mechanisms that explicitly encourage targeted efforts focusing on tool development and deployment may be particularly helpful.

Better incentives for sharing data, standards for describing data, and clarity of policies for secondary/future use of data are all vitally important to making contribution and reuse of high-quality data a more achievable goal. Attention paid to each of these areas will help encourage more researchers to spend more time and effort on data sharing, thus increasing the quality and quantity of results.

Although potentially indirect, infrastructure investments in tools and personnel will play an important role here. Specifically, better tools for data annotation and retrieval are needed. Highly-qualified research personnel with strong backgrounds in both computational sciences and biomedical sciences will be needed to build these tools.

The NIH should consider specific funding aimed at promoting more effective data reuse and management, along with development of appropriate policies for clarification of concerns regarding human subject protection and related issues that might complicate data sharing and integration.  
03/12/2012 at 10:43:02 AM Organization Medical Library Association // Association of Academic Health Sciences Libraries Chicago, IL // Seattle, WA Standards Development

AAHSL and MLA believe that work is definitely needed in developing and sharing data standards, definitions, and ontologies. Researchers are struggling to create their own approaches or trying to use definitions and structures intended for fields outside their own, for example, using CaBIG resources for surgical research. Additional standards must be developed for those datasets that are not digital. Biological specimens, MRI images, ultrasound videos and other formats need the same approach in terms of commonly used data definitions, standards for identifying methodologies, and controlled vocabularies of terms.

Secondary/future use of data sets

In order to share data in the future, researchers must have commonly defined data fields with specified structures for that data, and standard definitions for methodologies that can be linked to that data. These approaches will ensure not only that it can be shared, but that it will be meaningful and relevant for use in the future and by others working on the same project.

Data accessibility

A central repository of research data would increase sharing and leverage other research that builds off existing datasets. Currently identifying existing data sets is nearly impossible, leading to the duplication of effort among researchers and their institutions. Incentives for data sharing

Standards for reporting data citations and data publications are needed so that attribution is given to the original creators of data and to enable the tracking of the impact or usefulness of the data to other research endeavors. This would enable the inclusion of shared data activities as part of annual faculty evaluation and tenure and promotion review, similar to the current practice of citations for peer-reviewed journal articles. If the use of datasets is clearly acknowledged and cited, researchers could include as part of their faculty portfolio documentation on the extent to which their research data have been actually used (cited) or potentially used (published) in producing other research.

Data peer-review mechanisms could also create an incentive for producing high quality and shareable datasets. There would also need to be mechanisms to distinguish and differentially weight peer-reviewed data citations in relation to datasets that are simply made publicly available, but have less impact or usability within the research community.

Support needs

Support and development of training programs is needed to create a workforce that can assist researchers with complex data management issues. While advanced degree work needs to be supported, there are staff in place within institutions, such as library/information specialists, that be further trained in data management techniques and who could further assist investigators and train them in basic concepts and best practices.

Standards development

Data standards and definitions will also allow other experts to review and evaluate the data to ensure that it is valid and replicable. Other disciplines less familiar with a specific research area would be able to repurpose these datasets, leveraging the data collected and the funding used to support the original research. Once these common operational standards are in place, it should be possible to more easily extract data and present it to the general public, supporting research findings and outcomes.

These standardized approaches to large datasets would also enable future IRB (Institutional Review Board) review of older datasets when researchers want to repurpose the data or conduct retrospective studies. When the content of datasets or fields is not known, the risk for releasing or inappropriately using Personal Health Identifiers (PHI) exists, and it will be difficult for IRBs to judge the risks using shared data for research.

Data Accessibility

A central repository of research data would ensure that federally funded data is fully used by other investigators, leveraging research dollars while ensuring that research data is easily available for further discovery. If feasible, this should be a long-term goal for NIH. However, this is a tremendous undertaking and many datasets that are not federally funded may be excluded from such an approach. Another approach is to create a central indexing repository where information and links to other data repositories resides.

Support needs

AAHSL and MLA believe that more training needs to be provided on data curation and management, but at several levels. Certainly complex research needs more individuals trained in computational and analytical methods and there should be more funding for fellowships.

Additionally, there are other staff members within institutions who would benefit in training programs for the curation and management of data.

Standards Development

NIH, through the National Library Medicine (NLM), has played an important role in identifying standards for clinical care and knowledge. It would be beneficial to have that expertise applied to the research world. This would require additional funding support for NLM to take on this role, but it is one where NLM has the experience and expertise that would benefit the entire research community.

Librarians can be essential team players, not only in helping to develop standards and ontologies, but also in making their research communities aware of the resources available through NIH and other research groups and agencies.

Since research is diverse it is unlikely that standards will address every possible data collection need. Therefore, guidelines on best practices for creating and documenting data points should also be developed so that biomedical libraries and research teams can work together on the specific needs of a particular project or lab, without jeopardizing the long-term usefulness of the dataset to the researchers or others.

Secondary/future use of data

At the very least, it would be useful to create a repository of data structures, definitions, ontologies, etc. that have been developed by government agencies, organizations and research institutions so that researchers could begin to use structures that have proven useful in a research setting and avoid recreating the wheel. This could be an "open source" collection of "standards" generated, revised, and used by the research community.

Again, AAHSL and MLA maintain that librarians have the skills and expertise to assist researchers in understanding the necessity for, and applying the criteria for data definitions so that it can be shared in the future. Librarians can play an important role from the early planning of research proposals to the implementation of data management once a project is funded and should be part of the research team.

Another related issue is accessing paper records with older datasets containing PHI. More guidance is needed as to when older data sets can be accessed and shared, for example what are the requirements or review standards for research using records containing patient data sets prior to 1950, 1900, etc. Some historical medical records are of interest to researchers who study populations, trends in diseases and conditions, and public health issues. Many of these records are currently unavailable because institutions are concerned by the presence of PHI and how IRBs can review paper records that may not be standardized in terms of data or its presentation. Additional guidelines on older data sets would make these resources more easily accessible to investigators.

Data accessibility

Create a central indexing repository where information and links to other data repositories reside. Basic information about data definitions, methodologies, and ontologies, in abstract form, should be submitted by researchers along with other key information. Such a central index or clearinghouse would enable researchers to locate other datasets and make their own work more visible and accessible. It should be open for research data beyond federally funded projects.

Support needs

Librarians already are working with students and faculty on related research issues, and further training would enable them to train the research community in best practices, along with helping them to understand the importance of managing and sharing their datasets. In partnership with computational bio-informaticists and statisticians, librarians undertaking additional training opportunities can address data stewardship principles and practices including: data archival methods; metadata creation and usage; and awareness of storage, statistical analysis, archives and other available resources as part of a data stewardship training curriculum. Librarians, in partnership with other disciplines and experts can support training of the future research investigators and the workforce. It is recommended that the NIH develop training programs and fellowships that further develop the skills of this workforce that already exists in most institutions.

These comments are submitted on behalf of the Medical Library Association (MLA) and Association of Academic Health Sciences Libraries (AAHSL) and address the following issues: standards development, secondary/future use of data, data accessibility, incentives for data sharing, and support needs. The Association of Academic Health Sciences Libraries (AAHSL) (http://www.aahsl.org) is composed of the directors of the libraries of 116 accredited U.S. and Canadian schools as well as 28 associate members. AAHSL's goals are to promote excellence in academic health sciences libraries and to ensure that the next generation of health practitioners is trained in information seeking skills that enhance the quality of healthcare delivery. The Medical Library Association (MLA) (http://www.mlanet.org) is a nonprofit educational organization with approximately 4,000 health sciences information professional members worldwide. Founded in 1898, MLA provides lifelong educational opportunities, supports a knowledgebase of health information research, and works with a global network of partners to promote the importance of quality information for improved health to the health care community and the public. Standards Development AAHSL and MLA believe that work is definitely needed in developing and sharing data standards, definitions, and ontologies. Researchers are struggling to create their own approaches or trying to use definitions and structures intended for fields outside their own, for example, using CaBIG resources for surgical research. NIH, through the National Library Medicine (NLM), has played an important role in identifying standards for clinical care and knowledge. It would be beneficial to have that expertise applied to the research world. This would require additional funding support for NLM to take on this role, but it is one where NLM has the experience and expertise that would benefit the entire research community. Additional standards must be developed for those datasets that are not digital. Biological specimens, MRI images, ultrasound videos and other formats need the same approach in terms of commonly used data definitions, standards for identifying methodologies, and controlled vocabularies of terms. Librarians can be essential team players, not only in helping to develop standards and ontologies, but also in making their research communities aware of the resources available through NIH and other research groups and agencies. Secondary/future use of data In order to share data in the future, researchers must have commonly defined data fields with specified structures for that data, and standard definitions for methodologies that can be linked to that data. These approaches will ensure not only that it can be shared, but that it will be meaningful and relevant for use in the future and by others working on the same project. Data standards and definitions will also allow other experts to review and evaluate the data to ensure that it is valid and replicable. Other disciplines less familiar with a specific research area would be able to repurpose these datasets, leveraging the data collected and the funding used to support the original research. Once these common operational standards are in place, it should be possible to more easily extract data and present it to the general public, supporting research findings and outcomes. These standardized approaches to large datasets would also enable future IRB (Institutional Review Board) review of older datasets when researchers want to repurpose the data or conduct retrospective studies. When the content of datasets or fields is not known, the risk for releasing or inappropriately using Personal Health Identifiers (PHI) exists, and it will be difficult for IRBs to judge the risks using shared data for research. Since research is diverse it is unlikely that standards will address every possible data collection need. Therefore, guidelines on best practices for creating and documenting data points should also be developed so that biomedical libraries and research teams can work together on the specific needs of a particular project or lab, without jeopardizing the long-term usefulness of the dataset to the researchers or others. At the very least, it would be useful to create a repository of data structures, definitions, ontologies, etc. that have been developed by government agencies, organizations and research institutions so that researchers could begin to use structures that have proven useful in a research setting and avoid recreating the wheel. This could be an "open source" collection of "standards" generated, revised, and used by the research community. Again, AAHSL and MLA maintain that librarians have the skills and expertise to assist researchers in understanding the necessity for, and applying the criteria for data definitions so that it can be shared in the future. Librarians can play an important role from the early planning of research proposals to the implementation of data management once a project is funded and should be part of the research team. Another related issue is accessing paper records with older datasets containing PHI. More guidance is needed as to when older data sets can be accessed and shared, for example what are the requirements or review standards for research using records containing patient data sets prior to 1950, 1900, etc. Some historical medical records are of interest to researchers who study populations, trends in diseases and conditions, and public health issues. Many of these records are currently unavailable because institutions are concerned by the presence of PHI and how IRBs can review paper records that may not be standardized in terms of data or its presentation. Additional guidelines on older data sets would make these resources more easily accessible to investigators. Data Accessibility A central repository of research data would increase sharing and leverage other research that builds off existing datasets. If feasible, this should be a long-term goal for NIH. However, this is a tremendous undertaking and many datasets that are not federally funded may be excluded from such an approach. Another approach is to create a central indexing repository where information and links to other data repositories resides. Basic information about data definitions, methodologies, and ontologies, in abstract form, could be submitted by researchers along with other key information. Such a central index or clearinghouse would enable researchers to locate other datasets and make their own work more visible and accessible. Incentives for data sharing Standards for reporting data citations and data publications are needed so that attribution is given to the original creators of data and to enable the tracking of the impact or usefulness of the data to other research endeavors. This would enable the inclusion of shared data activities as part of annual faculty evaluation and tenure and promotion review, similar to the current practice of citations for peer-reviewed journal articles. If the use of datasets is clearly acknowledged and cited, researchers could include as part of their faculty portfolio documentation on the extent to which their research data have been actually used (cited) or potentially used (published) in producing other research. Data peer-review mechanisms could also create an incentive for producing high quality and shareable datasets. There would also need to be mechanisms to distinguish and differentially weight peer-reviewed data citations in relation to datasets that are simply made publicly available, but have less impact or usability within the research community. Support Needs AAHSL and MLA believe that more training needs to be provided on data curation and management, but at several levels. Certainly complex research needs more individuals trained in computational and analytical methods and there should be more funding for fellowships. Additionally, there are other staff members within institutions who would benefit in training programs for the curation and management of data. Librarians already are working with students and faculty on related research issues, and further training would enable them to train the research community in best practices, along with helping them to understand the importance of managing and sharing their datasets. In partnership with computational bio-informaticists and statisticians, librarians undertaking additional training opportunities can address data stewardship principles and practices including: data archival methods; metadata creation and usage; and awareness of storage, statistical analysis, archives and other available resources as part of a data stewardship training curriculum. Librarians, in partnership with other disciplines and experts can support training of the future research investigators and the workforce. It is recommended that the NIH develop training programs and fellowships that further develop the skills of this workforce that already exists in most institutions.
03/12/2012 at 10:56:00 AM Organization Nature Publishing Group United Kingdom, with office in New York City Please see attachment Please see attachment Please see attachment The Advisory Committee to the NIH Director (ACD) has established a working group to investigate the management, integration, and analysis of large biomedical datasets (DWIG). This group is charged with providing recommendations to the ACD by June 2012. To this end, DWIG is soliciting feedback from stakeholders and the general public. Nature Publishing Group (NPG) is a publisher of high impact scientific and medical information in print and online. NPG publishes journals, online databases, and services across the life, physical, chemical and applied sciences and clinical medicine. For more information about us please go to: http://www.nature.com/npg_/company_info/index.html Introduction Data creation per se does not advance scientific knowledge but it is the foundation on which scientific understanding and interpretation is built. As research funding underwrites the generation of increasingly large amounts of data, it becomes imperative that the data can be widely accessed, understood, and analyzed by the largest possible community of researchers. NPG routinely publishes peer-reviewed research based on new datasets, but we are keenly aware that the value of these articles is diminished if readers do not have access to the raw data, or do not have sufficient information to understand how the data has been generated. We (Nature Publishing Group) are holding active discussions with scientists, funders and editors to decide how we might assist the research community to promote and facilitate deposition, access, and shared use of data. Here are our current thoughts: Challenges to management, integration, and analysis of large biomedical datasets. Funder policy development Over the past few years, funding organizations have individually begun to create and implement policies for sharing data, however, the lack of consensus between funder policies undermines the effort to promote data sharing. To address this, initiatives have been taken to arrive at joint positions. For an example, see http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Public-health-and-epidemiology/ which focuses on epidemiological studies, and was initiated in 2010 at a workshop organized by the Wellcome Trust and Hewlett Foundation, and to which the NIH is a key signatory. In addition to acknowledging the need to create data standards, this public statement states that signatories' immediate goals are to see that: Data sharing is recognized as a professional achievement Funders and employers of researchers recognize data management and sharing of well-managed datasets as an important professional indicator of success in research. Secondary data users respect the rights of producers and add value to the data they use Researchers creating data sets for secondary analysis from shared primary data are expected to share those data sets and act with integrity and in line with good practice - giving due acknowledgement to the generators of the original data. NPG fully supports this view: From NPG's policy on data sharing: http://www.nature.com/authors/policies/availability.html "An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications in material transfer agreements. Any restrictions on the availability of materials or information must be disclosed to the editors at the time of submission. Any restrictions must also be disclosed in the submitted manuscript, including details of how readers can obtain materials and information. If materials are to be distributed by a for-profit company, this must be stated in the paper." While the NIH's goal should always be to promote the widest possible sharing of research data, there are examples of data for which sharing necessarily involves restrictions. Sensitive human data can be protected using tiered access governed by Institutional Review Boards: for example, unrestricted information (metadata) about the data, could be made widely available, while access to the data itself is restricted. In situations concerning access to proprietary data produced in commercial settings, or data that might compromise national security, the NIH could help to set standard embargo periods for access. Since large research projects often have multiple funders and researchers spread across the globe, we urge the NIH to lead in working with non-US biomedical funders such as BBRSC or The Wellcome Trust to establish consistent data policies where regional laws permit. The ultimate goal should be to make data interoperable regardless of geographic origin or funding source. NPG stands prepared to assist with this process by encouraging compliance with funder policies from the authors of research articles. Furthermore, we have been discussing with funders how we may help to define mechanisms for achieving interoperability, and are keen to support community efforts to reach this goal. Data Standards In parallel with efforts towards refinement and consensus of data policies, standards should be developed for data itself. Community compliance with consensus-built data standards is essential to guarantee consistency of data quality. The genomics community has led the way with standards development and first described MIAME (Minimum Information About a Microarray Experiment) more than 10 years ago (Brazma et al. in Nature Genetics http://www.nature.com/index.html?file=/ng/journal/v29/n4/abs/ng1201-365.html. We see the first challenge as arriving at consensus standards for a wide array of disciplines beyond genomics. Commonly, standards are the polished output of community discussions and workshops; NPG has supported this process by hosting pre-prints of standards proposals for open comment (see http://precedings.nature.com/documents/5252/version/2). Meanwhile, polished standards with clear community buy-in have been published in our peer reviewed titles (e.g. http://www.nature.com/nbt/journal/v25/n8/full/nbt1329.html). Since May 2011, NPG's policy is to make standards papers openly available under Creative Commons Attribution-Non Commercial-Share Alike license to promote the widest possible use. Different standards may be adopted for different data types according to the needs of the community, however, where possible, the NIH should lead the way by insisting on the most generic data storage with the fullest and most standardized metadata. For example, unique citable identifiers (UUID, DOI, URI, ORCID) should be used routinely for individuals, roles, grants, datasets, samples, departments, institutes, funders and projects and standard templates for data sharing plans should be established. While we hope to see the NIH encourage creation of community standards, we see an important role for publishers in supporting standards compliance. For example, once a community has developed and published standards for a dataset, publishers should request that authors of subsequent research articles state whether they are in compliance with the standards, which they should also cite. This will encourage explicit compliance, improve the quality of the data, and increase awareness of the standards themselves. Furthermore publishers should be collaborating with organizations such as Biosharing http://biosharing.org/ whose mission is to provide a forum for development of data sharing policies and to curate existing bioscience data polices (primarily for genomics) http://www.nature.com/ng/journal/v44/n2/full/ng.1054.html. Going a step further, Biosharing and Sagebionetworks are looking at ways to connect biomedical databases (the data commons initiative), and through Nature Genetics, we are examining ways in which we can support data integration as it becomes possible. Data Sharing Plans and Descriptors There are cultural differences in willingness to share data. Some research fields, particularly those that are multidisciplinary, encourage and support sharing, others do not. For some, mandating data sharing is sufficient (for example for the astrophysics community, who share public telescopes), but for others, fear of being scooped can make data sharing less effective regardless of funder and publisher mandates. Creating clear incentives will help those researchers to recognize the benefits of prompt data sharing. In 2010, we conducted a small experiment with members of the Human Microbiome Project, to create standard templates for data sharing plans, which we subsequently hosted as a funder collection (see http://precedings.nature.com/collections/human-microbiome-project). This gave us the opportunity to think about how best to capture metadata about the data itself, and test whether providing credit for creators of sharing plans was an incentive (the plans that we hosted each have citable DOI's). We found that creating citable records of the data plans was appealing to researchers, but, because significant time can elapse between creation of the plans that typically accompany grant applications, and deposition of the data created with the funding, we have been unable to determine whether merely posting the plans has improved data deposition or researcher access to the data itself. In light of this, we are now testing the idea of asking researchers to submit, in a templated format, extended metadata about the actual data deposited (a Data Descriptor). In this concept, the Publisher, would arrange for submission and peer-review of the Data Descriptor and would publish the accepted document under creative commons license with a DOI, ensuring that it was linked to the data wherever it was deposited, and to research articles that were published. Data deposit, duplication, and retention Since 2008, NPG has had a manuscript deposition service for authors whose funders have agreements with PubMedCentral or UK PubMedCentral through which authors can opt to provide all necessary information for MS deposition as part of their online submission. For these authors, NPG uploads the authors' final version of the accepted manuscript upon acceptance which is then made publicly accessible 6 months after publication, with a link back to the journal's website. see http://www.nature.com/authors/author_resources/deposition.html. Where there are established repositories, such as those supported by the NCBI, NPG already mandates data deposition as a prerequisite for publication of research articles. We then ensure that we link to the data accession code from the published research article. Where there are no dominant repositories, there are alternatives. For example, publishers could work with communities to identify and increase awareness of candidate databases and there may be opportunities for publishers to assist authors with data deposition as a result. An example of a publisher-supported repository is Dryad http://datadryad.org/, which primarily hosts evolutionary data. Alternatively, each publisher could commit to storing data itself. In practice, data is often captured and displayed as Supplementary Information (SI) to research articles, and while this does provide more information for readers of our published research articles, the SI is comprised of a variety of file formats with little standardization of SI between our own titles. The NIH could lead in creating a registry of repositories with periodic review to encourage consolidation and interoperability. Duplication and retention of data are related challenges. In some disciplines, where data standards and policies have been lacking, reproducibility of experimental results from existing data is an issue which encourages duplication of effort and creation and deposition of new datasets. Encouraging creation and compliance with policy and standards will do much reduce this. Where policies and standards are in place, but creation of datasets becomes cheaper than storage, it may be more cost-effective not to keep data. The NIH can help to resolve the question of whether ALL data should be retained in perpetuity, whose responsibility data preservation should be, and how to manage linking to- and use of- defunded or archived repositories. NPG is eager to work with funders to arrive at better solutions for housing and retaining data that will improve its value, and strike a cost-effective way of balancing retention against ease of creating new data. Summary Research funding increasingly results in creation of data, a percentage of which is not used by the researcher who created it, or ever leads to research papers that may- or may not- feature the researcher as author. Reasons for this wastage can include poor planning of grant resource, negative results that cannot be published (even though the data itself is well organized and accessible), poorly documented provenance of the data such that it cannot be reused by others, lack of compliance with data sharing policies, or simple lack of data retention. Currently, such data is largely invisible, and its value is unknown. Citing data in research papers may increase visibility, but again, if data structure and provenance are not clear, reproducing the existing results, or reusing the data will be difficult, and again the full value of the data isn't realized. To be able to understand the return on investment in data creation, and to make recommendations for advancing scientific knowledge in the most cost-effective way, there are several essential first steps, and we see the NIH as a key driver for each: § Funders will have to reach consensus on data management policies, including how to plan for data creation, how to track data creation, storage, and sharing, and how to relate this to output of useful research articles and white papers that can be used to refine policy. § Research communities will have to look at datasets and arrive at a consensus for minimum standards for each type of data, such that whatever it is, it can be understood by appropriate users, and is structured well enough to be machine readable. § The NIH can play a part in encouraging investigator compliance by rewarding those that share their data and conform to data standards and NIH policies. This might be achieved by explicitly recognizing datasets as research output and thereby ensuring that they are included in applications for new funding and renewed funding. § Organizing, managing, making data discoverable and appropriately accessible, and preserving it in perpetuity is costly. How will this be funded? Whose responsibility will this be? The NIH can help to resolve the question of whether ALL data should be retained in perpetuity, whose responsibility data preservation should be, and how to manage linking to- and use of- defunded or archived repositories. The NIH could also lead in creating a registry of repositories with periodic review to encourage consolidation and interoperability. We thank the ACD-DWIG for the opportunity to contribute to this important discussion.
03/12/2012 at 11:32:03 AM Organization University of Pittsburgh Graduate School of Public Health Pittsburgh, PA Large biomedical datasets that are currently being collected routinely or for research purposes are of great value for scientific discovery and better disease control programs but lack of access has undermined the use of these data for improved public health. Open access should be provided to all disaggregated public health and biomedical data - while respecting privacy concerns - to enable scientific progress and advances in disease control. Great progress has been made in data sharing in many disciplines such as genomics, astronomy and earth sciences but not in public health. Developments such as the Open Government Initiative by the US Federal Government and the International Household Survey Network supported by the WorldBank provide a promising start but will require a wider support base for a paradigm shift for data sharing in public health.

Critical issues that should be considered to improve access and sharing of biomedical data could be listed as barriers to data access. We identified 15 barriers that were grouped in six categories:

1. Data related barriers 2. Communication & Incentive related barriers 3. Economic barriers 4. Political barriers 5. Legal barriers 6. Ethical barriers

1. Data related

Availability Public health data is collected through various routine and ad-hoc systems such as disease surveillance, household surveys, service registration (eg. insurance claims and medical records), and research studies. The types and amount of public health data varies between countries but is often insufficient for monitoring and evaluation of health and development programs, in particular in low- and middle income countries. The greatest barrier to data sharing is not collecting data in the first place. It is often difficult to find out what data are available for certain countries or topics and an international catalog system of public health data should be developed to better assess data availability and gaps.

Data availability is also determined by preservation. Data preservation prevents historical public health data from perishing. Public health data is often collected for immediate purposes such as disease control and planning and explicit preservation will be required to prevent degradation after its immediate use. Due to lack of data management and archival capacity, public health records are often destroyed after a certain amount of time. This is most obviously the case for paper based records that remain widely used in low- as well as high income settings, but also applies to electronically stored data that could easily be lost during software and hardware updates or staff changes. Major investments will be required to digitize and preserve disaggregated public health data according to international standards.

Physical access One of the main challenges to research groups and public health agencies alike is finding public health data. Paper based records and an increasing amount of digital public health and biomedical data need to be catalogued and archived for future use. Funding and attention in health sciences are directed towards data collection and analysis but not to preservation for additional, future use. Paper based and digital data are frequently stored using decentralized ad-hoc system that lack appropriate cataloging and metadata systems. Access to digital data could be impaired by additional challenges such as lack of interoperable file formats and language, version control and documentation. These challenges are perpetuated by insufficient technical capacity and lack of staff trained in data management. Clear international guidelines and training for public health data management and archiving will be required to ensure better access to public health data.

In addition, technological solutions for data tracking, computer networks, and data access systems will be required. These solutions have been developed in other fields such as genomics, astronomy, and social sciences and will need to be translated to public health. Access systems have been developed for public health data in the past but are too often based on data storage in a central repository rather than a central query system that links data archives from multiple places. This would provide enhanced flexibility, stability and sustainability of a data access system.

2. Communication & incentive

Arguably the largest barrier to sharing of research data in health sciences is the reward system that is based on publications and grant acquisition. Not sharing data will enable investigators to generate more publications and grants before others without access to the same data. Although scientific journals and research funding agencies increasingly require open access to research data, the track record of data collection and sharing are not typically considered during manuscript or grant review processes or by promotion committees. As long as data collection and sharing are not officially valued by the academic reward system in health sciences, sharing of research data will be limited. The incentive to share data for the public good for individual investigators and their institutions will be outweighed by the incentive for personal (and institutional) gain. An international registration system of collected data in health sciences or publication of datasets after peer review would provide opportunities for considerations of data collection and sharing practices during manuscript or grant reviews and could form an additional basis for promotion and tenure.

Incentives are different for routinely collected public health or biomedical data by agencies with a different reward system. Staff from such agencies may lack any incentive to spend time and effort on data sharing and even co-authorship on scientific publications may be insufficient to convince agencies with a lack of human capacity to allocate time for data sharing. In particular in low- and middle income countries, public health or medical agencies frequently lack capacity for analysis of routinely collected data and these agencies would be reluctant to share data with academic institutions from high-income countries that would benefit the most from data exchanges. In these situations, incentive can be provided by offering degree training opportunities to partners in low- and middle income countries. A strong disincentive for sharing of routinely collected data could be fear that errors may be discovered or that disease control efforts may seem to be failing. In many countries, routine disease notifications are used to evaluate control programs, providing a strong incentive for under reporting of disease or over reporting of program coverage rates. Clear communication, a basis of trust and data use agreements will be required to reduce such disincentives. International guidelines or a code of conduct could greatly facilitate this process.

3. Economic

Lack of resources Data sharing is an elaborate process that requires trained staff for data annotation, data management, the transfer process, and continued communication with data recipients. In addition, technological resources are required for data sharing such as computer networks, file servers, data transfer protocols and a security infrastructure. This capacity is often lacking in research institutions and health agencies where data collection and analysis are much more valued and better funded compared to data management and sharing. This is particularly true in low- and middle income countries that struggle to keep disease control programs running. Ironically, higher upfront investments in data preservation and sharing could result in more efficient and effective control programs in the future. Resources will be required to enable countries to make such upfront investments.

Economic damage Global cooperation for disease control requires sharing of surveillance data between countries as mandated by the International Health Regulations. Sharing public health data could lead to economic losses due to a decrease in tourism or trade. Many countries that are endemic for infectious diseases are often also popular tourist destinations or export agricultural products. The economic damage of recent outbreaks of SARS and Foot & Mouth disease has been estimated at 50 and 30 billion USD respectively. Such figures provide great disincentives to countries for sharing of disease notification data.

Missing financial opportunities Agencies or investigators may anticipate financial gain from intellectual property that would be lost by sharing data. This was illustrated by Indonesia's refusal to share genetic sequences of influenza after a commercial vaccine had been developed based on these data. At an individual level, investigators could anticipate commercializing products based on health data.

4. Political

Loss of control Data sharing leads to the loss of control over the use and interpretation of data. This could be a concern for investigators or research institutes since the misuse of data could lead to biased conclusions that could discredit the original study and related investigators. This also applies to routinely collected biomedical or public health data but misuse of these data could not only affect individual agencies or staff but public trust in the medical establishment or government as well. Misuse could be intentional but would be unintentional in most cases. Most data have specific characteristics related to collection methods and definitions that should be taken into account during secondary data analysis but may be overlooked in the absence of appropriate metadata. International norms and increased training for creating metadata in health sciences would decrease opportunities for misuse of data. Additionally, data use agreements and clear communication could reduce data misuse.

Lack of guidelines Health agencies may have complex guidelines for data sharing or these guidelines may be absent or unknown. This may lead to reluctance to share data or inconsistent ad-hoc procedures. These procedures could become very lengthy, especially if committees meet infrequently or if high level authorization is required. Clear and consistent guidelines for data sharing as well as training in the use of these guidelines would facilitate the data sharing process. An international agency could coordinate the development of such guidelines and support countries or agencies that lack adequate administrative structures.

5. Legal

Data collection in health sciences and routinely collected biomedical and public health data are mostly supported by public funding and ultimately owned by the general public. However staff and investigators that have invested time and effort in data collection would often claim personal or institutional ownership as well. Medical or public health agencies are also mandated to protect the privacy of personal clinical information. Public investments in data collection should be maximally used to the benefit of the public and should be shared as widely as possible. Modern techniques for data masking and use of synthetic populations are widely available and will adequately address privacy concerns. Informed consent is required for the use of human subject data in scientific studies and could limit the use of data for secondary analysis. Broad consent statements that include valuable but unforeseen applications of data are increasingly being used to address this issue. Clear and reasonable legal frameworks are required to advance data sharing while addressing privacy and consent issues proportionally to realistic threats. Clear consensus will be needed on public ownership of publicly funded data. Official rewarding of data collection and sharing will reduce the need to claim data ownership by individual investigators or institutions.

6. Ethical

Various ethical considerations apply to sharing of public health and biomedical data such as serving the common good, equity issues, causing no harm, and reciprocity. Multiple international funding agencies and global health agencies have developed principles for public health data sharing. The Bill & Melinda Gates Foundation identified eight principles: promotion of the common good, respect, accountability, stewardship, proportionality, and reciprocity. In a joint statement, global health agencies proposed that data sharing should be equitable, ethical and efficient. Most of these principles call for: 1) a recognition or reward structure for data collection efforts, 2) responsibility in data use that safeguards privacy of individuals and dignity of communities and 3) the use of data to advance to public good. In addition to these principles, issues related to the North-South divide in research capacity should be addressed. Particularly, efforts should be made to increase research capacity in low- and middle income countries to enable these countries to analyze their own data and to improve collection of new data.

The National Institutes of Health of the United States are uniquely positioned to advance access and sharing to health data as one of the leading health research funding agencies worldwide. The NIH should take the lead in changing the culture of data sharing in health sciences as in genomics, establishing new norms for the acceleration of scientific progress. Priorities that could be addressed by the NIH in the near future are: 1) changing the reward structure for data collection and sharing in health sciences, 2) improving the availability and physical access to health data, and 3) the development of international guidelines and support for data sharing processes.

1. Changing the reward structure for data collection and sharing

The collection and sharing of health data should be equally rewarded as the analysis of data. The NIH probably has more leverage to advance this culture change than almost any other agency worldwide. All routine and research data collected in biomedical sciences and public health should be registered and annotated using international metadata standards. Investigators and health staff should be stimulated to publish fully annotated data sets through a peer-review process and receive academic credit for such publications. Lastly, the track record of data collection and sharing of investigators and institutions should be taken into account during the grant review process. The NIH should use its political power to advocate for the use of data sharing track records by academic tenure and promotions committees.

2. Improving the availability and physical access to health data

The NIH should maximize the impact of publicly funded health data by making it maximally available to research groups and the general public. Disaggregated routinely collected data in paper and electronic format should be digitized, standardized, annotated and made available to anybody for analysis. Privacy and consent concerns should be adequately addressed by modern masking techniques or requirement for broad consent statements in future studies. Routinely collected data in particular will be of great value for large scale data driven studies in health sciences. Support and training in data management and annotation in health sciences will need to be greatly enhanced by grant requirements and newly created education programs. A data sharing infrastructure will need to be created and maintained. Specialized inter-disciplinary centers could be established to facilitate this.

3. The development of international guidelines and support for data sharing

International guidelines and codes of conduct have enabled data sharing in genomics and other sciences. Principles for public health research data sharing have been proposed by various funding agencies but need to be expanded and operationalized to support data sharing processes. A political process would be required to develop a broad consensus for open access and data sharing guidelines that would be supported by most countries. The NIH could advance a code of conduct within its network of grantees and funding agencies.

The value of health data for primary and secondary analysis should be reflected in NIH policies and procedures. These policies and procedures should ensure that publicly funded data collection will lead to data sets that don't end their life cycle unused in individual or institutional archives but that will be maximally available, distributed, multiplied, modified and used for scientific advancement and for better disease control. The NIH should modify policies and procedures for grant making to create incentives for data sharing. Collection and sharing of health data should be recorded, tracked and used as criterion during manuscript and grant review and during academic promotion and tenure decisions. Training programs in data management, annotation and analysis could be created as incentive for data sharing by health agencies, particularly in low- and middle income countries. NIH policies and procedures should include requirements for adherence to the ethical principles of data sharing by grantees and enforce data sharing plans in current and future grants.

Current support for data management, annotation and sharing is dwarfed by support for data collection and analysis. This creates a message contrary to advocating the value of data sharing in health sciences. Increased support in research grants for technological and human resources for data management (including preservation), creation of metadata, and data sharing will be required to advance data access and sharing. Without this support, investments in data collection will yield limited return. In addition, new research efforts will be required to develop technological solutions for better access to health data. New training programs will be needed to develop human capacity for data management and sharing in health agencies and to provide the next generation of investigators and public health staff with the skills and mindsets for data sharing.

Bibliography Open access principles and guidelines from international health and funding agencies and journals "Budapest Open Access Initiative." Retrieved Feb 29, 2012, from http://www.soros.org/openaccess. Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities; 2003 Oct 22. Prepublication data sharing. Nature. 2009; 461(7261): 168-70. Data's shameful neglect. Nature. 2009; 461(7261): 145. Open sesame. Nature. 2010; 464(7290): 813. Sharing public health data: necessary and now. Lancet. 2010; 375(9730): 1940. Dealing with data. Challenges and opportunities. Introduction. Science. 2011; 331(6018): 692-3. MRC policy on sharing of research data from population and patient studies. 2011 Wellcome Trust. Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility; 2003. Bill & Melinda Gates Foundation. Global Health Data Acces Principles. [Online document] 2011 [cited 2012 24 Jan]; Available from: www.gatesfoundation.org/global-health/.../data-access-principles.pdf Brest P. President's statement 2007: the importance of data: The William and Florea Hewlett Foundation; 2007. Carlson D. A lesson in sharing. Nature. 2011; 469(7330): 293. CDC/ATSDR. CDC-CSTE Intergovernmental Data Release Guidelines Working Group (DRGWG) Report:CDC-ATSDR Data Release Guidelines and Procedures for Re-release of State-Provided Data; 2005. Chan L, Arunachalam S, Kirsopc B. Open access: a giant leap towards bridging health inequities. Bull World Health Organ. 2009; 87: 631-5. Chan M, Kazatchkine M, Lob-Levyt J, Obaid T, Schweizer J, Sidibe M, et al. Meeting the Demand for Results and Accountability: A Call for Action on Health Data from Eight Global Health Agencies. PLoS Med. 2010; 7(1): e1000223. National Science Foundation. Changing the Conduct of Science in the Information Age; 2011. Hanson B, Sugden A, Alberts B. Making data maximally available. Science. 2011; 331(6018): 649. Nelson B. Data sharing: Empty archives. Nature. 2009; 461(7261): 160-3. Organization for Economic Co-operation and development. OECD Principles and guidelines for access to research data from public funding. Paris: OECD; 2007. World Health Organization. WHO's role and responsibilities in health research; 2010. Piwowar Ha Fau - Day RS, Day Rs Fau - Fridsma DB, Fridsma DB. Sharing detailed research data is associated with increased citation rate. (1932-6203 (Electronic)). Schofield PN, Bubela T, Weaver T, Portilla L, Brown SD, Hancock JM, et al. Post-publication sharing of data and tools. Nature. 2009; 461(7261): 171-3. Wellcome Trust. Policy on data management and sharing. 2010 [cited; Available from: http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm Walport M, Brest P. Sharing research data to improve public health. Lancet. 2011; 377(9765): 537-9. Willinsky J. The Access Principle: The Case for Open Access to Research and Scholarship: Cambridge: MIT Press; 2006. Ethical and legal considerations for open access and data sharing Chadwick R. International data-sharing: standardisation, harmonisation and ethics. 2009. Duncan G, Stokes L. Data masking for disclosure limitation. Wiley Interdisciplinary Reviews: Computational Statistics. 2009; 1(1): 83-92. Evans BJ. Much Ado About Data Ownership. Harvard Journal of Law and Technology, Vol 25, Fall 2011. Fienberg SE. Statistical perspectives on confidentiality and data access in public health. (0277-6715 (Print)). Fienberg SE MM, Straf ML,, editor. Sharing Research Data. Washington DC: National Academy Press; 1985. Freymann JB, Kirby JS, Perry JH, Clunie DA, Jaffe CC. Image Data Sharing for Biomedical Research- Meeting HIPAA Requirements for De-identification. J Digit Imaging. 2011. Fullerton SM. Identifiability, Data Sharing, and the Public. Center for Genomics & Healthcare Equality, University of Washington. Lane J, Schur C. Balancing Access to Health Data and Privacy: A Review of the Issues and Approaches for the Future. Health Services Research. 2010; 45(5p2): 1456-67. Lawlor DA, Stone T. Public health and data protection: an inevitable collision or potential for a meeting of minds? International Journal of Epidemiology. 2001; 30(6): 1221-5. Lee Lm Fau - Gostin LO, Gostin LO. Ethical collection, storage, and use of public health data: a proposal for a national privacy protection. (1538-3598 (Electronic)). Mariner WK. Mission Creep: Public Health Surveillance and Medical Privacy. Boston University Law Review, Vol 87, No 2, p 347, April 2007. McSherry C. Who Owns Academic Work?: Battling for Control of Intellectual Property Harvard University Press 2009. Centers for Disease Control and Prevention. Data Security and Confidentiality Guidelines for HIV, Viral Hepatitis, Sexually Transmitted Disease, and Tuberculosis Programs: Standards to Facilitate Sharing and Use of Surveillance Data for Public Health Action. Atlanta (GA): U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2011. Stansfield S. Who owns the information? Who has the power? Bull World Health Organ. 2008; 86(3): 170-1. US Department of Health and Human Services. Summary of the HIPAA Privacy Rule. [Online document] 2012 [cited 2012 25 Jan]; Available from: http://www.hhs.gov/ocr/privacy/hipaa/understanding/summary/index.html Data management and sharing in public health Pisani E, AbouZahr C. Sharing health data: good intentions are not enough. Bull World Health Organ. 2010; 88(6): 462-6. Pisani E, Whitworth J, Zaba B, Abou-Zahr C. Time for fair trade in research data. Lancet. 2010; 375(9716): 703-5. Butler D. Disease surveillance needs a revolution. Nature. 2006; 440(7080): 6-7. Byass P. The Unequal World of Health Data. PLoS Med. 2009; 6(11): e1000155. Chandramohan D, Shibuya K, Setel P, Cairncross S, Lopez AD, Murray CJ, et al. Should data from demographic surveillance systems be made more widely available to researchers? PLoS Med. 2008; 5(2): e57. Coloma J Fau - Harris E, Harris E. Open-access science: a necessity for global public health. (1553-7374 (Electronic)). Dawson A, Verweij M. Could do Better: Research Data Sharing and Public Health. Public Health Ethics. 2011; 4(1): 1-3. Department of Health and Human Services. Health.Data.Gov. [Website] [cited 2012 25 Jan]; Available from: http://www.data.gov/health Dupriez O, Boyko E. Dissemination of microdata files: principles, procedures and practices: International Household Survey Network; 2010 August. Report No.: 5. Fine AM, Goldmann DA, Forbes PW, Harris SK, Mandl KD. Incorporating Vaccine-Preventable Disease Surveillance Into the National Health Information Network: Leveraging Children's Hospitals. Pediatrics. 2006; 118(4): 1431-8. Health Metrics Network. Framework and standards for country health information systems. Geneva: World Health Organization; 2008. Hürlimann E, Schur N, Boutsika K, Stensgaard A-S, Laserna de Himpsl M, Ziegelbauer K, et al. Toward an Open-Access Global Database for Mapping, Control, and Surveillance of Neglected Tropical Diseases. PLoS Negl Trop Dis. 2011; 5(12): e1404. Kephart G. Barriers to Accessing & Analyzing Health Information in Canada November 2002. Lang T. Advancing global health research through digital technology and sharing data. Science. 2011; 331(6018): 714-7. Langat P, Pisartchik D, Silva D, Bernard C, Olsen K, Smith M, et al. Is There a Duty to Share? Ethics of Sharing Research Data in the Context of Public Health Emergencies. Public Health Ethics. 2011; 4(1): 4- 11. McNabb SJ. Comprehensive effective and efficient global public health surveillance. BMC Public Health. 2010; 10 Suppl 1: S3. World Health Organization. Framework and Standards for Country Health Information Systems. Geneva; 2008 June. Examples of open access and data sharing of health data Boussard E, Flahault A, Vibert JF, Valleron AJ, Noah N, Williams J, et al. Sentiweb: French Communicable Disease Surveillance On The World Wide Web. BMJ: British Medical Journal. 1996; 313(7069): 1381-4. Charles Teller AH, and Negash Teklu. Barriers to Access and Effective Use of Data and Research for Development Policy in Ethiopia. The Demographic Transition and Development in Africa : The Unique Case of Ethiopia Springer; 2011. p. 323-7. East African Community Portal. East African Integrated Disease Surveillance Network. 2011 [cited 2011 18 February]; Available from: http://www.eac.int/eaidsnet Malaria Atlas Project. Malaria Atlas Project. 2011 February 11 [cited 2011 February 21]; Available from: http://www.map.ox.ac.uk/ Measure DHS. Demographic and Health Surveys. 2011 [cited 2011 February 21]; Available from: http://www.measuredhs.com/ Oo MK. Mekong Basin Disease Surveillance Network. 2007 [cited 2011 18 February]; Available from: http://www.mbdsoffice.com/index_2008.php UNICEF. Multiple Indicator Cluster Surveys. 2010 June [cited 2011 February 21]; Available from: http://www.childinfo.org/mics.html World Health Organization. Global Health Atlas. 2007 [cited 2011 18 February]; Available from: http://apps.who.int/globalatlas/ World Health Organization. Global Health Observatory. 2011 [cited 2011 February 21]; Available from: http://www.who.int/gho/en/ Open access, data management and data sharing in other disciplines Data, data everywhere: A special report on managing information. The Economist. 2010 27 February. GEOSS Data Sharing Action Plan. Group on Earth Observations -VII Plenary 2010; 2010. Anderson WL. Some challenges and issues in managing, and preserving access to, long-lived collections of digital scientific and technical data. Data Science Journal. 2004; 3: 191-201. Baldwin W, Diers J. Demographic Data for Development in Sub-Saharan Africa. New York: Population Council,New York; 2009. Bohannon J. Digital data. Google opens books to new cultural studies. Science. 2010; 330(6011): 1600. Bohannon J. Digital data. Google books, Wikipedia, and the future of culturomics. Science. 2011; 331(6014): 135. Callaway E. No rest for the bio-wikis. (1476-4687 (Electronic)). Center for American Progress. YES WE SCAN: John Podesta Calls on President Obama to Create a Digital Library of Congress. [Website] 2012 Jan 10, 2012 [cited 2012 25 Jan]; Available from: http://www.americanprogress.org/pressroom/releases/2012/01/Yes_Scan Chokshi Da Fau - Parker M, Parker M Fau - Kwiatkowski DP, Kwiatkowski DP. Data sharing and intellectual property in a genomic epidemiology network: policies for large-scale research collaboration. (0042-9686 (Print)). Corrado EM. The Importance of Open Access, Open Source, and Open Standards for Libraries. Issues in Science and Technology Librarianship. 2005. Curry A. Rescue of old data offers lesson for particle physicists. Science. 2011; 331(6018): 694-5. Edwards P, Mayernik MS, Batcheller A, Bowker G, Borgman C. Science Friction: Data, Metadata, and Collaboration. Social Studies of Science. 2011. HM Government. Data.gov UK: Opening up government. 2011 [cited 2011 18 February]; Available from: http://data.gov.uk/ Kahn SD. On the future of genomic data. Science. 2011; 331(6018): 728-9. King G. Ensuring the data-rich future of the social sciences. Science. 2011; 331(6018): 719-21. Krishnamurthy m. Open access, open source and digital libraries- A current trend in university libraries around the world. Program: electronic library and information systems. 2008; 42(1): 48-55. Letovsky SI, Cottingham RW, Porter CJ, Li PWD. GDB: The Human Genome Database. Nucleic Acids Research. 1998; 26(1): 94-9. Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, et al. Quantitative analysis of culture using millions of digitized books. Science. 2011; 331(6014): 176-82. Office of E-government and IT, Office of Management and Budget. Data.gov Concept of operations draft. Washington DC: Executive office of the President of the United States; 2009. Reichman OJ, Jones MB, Schildhauer MP. Challenges and opportunities of open data in ecology. Science. 2011; 331(6018): 703-5. Robinson F. EU Calls for Digitization of Cultural Heritage. The Wall Street Journal. 28 October 2011 28 October. Sloan Digital Sky Survey. The Sloan Digital Sky Survey, mapping the universe. [Website] [cited 2012 25 Jan]; Available from: http://www.sdss.org/ Veerle Van den Eynden LC, Matthew Woollard and Libby Bishop. MANAGING AND SHARING DATA. Second ed. Essex, UK: UK Data Archive, University of Essex; 2009. Wright G, Prakash P, Abraham S, Shah N. Open government data study: India: The Centre for Internet and Society. Dealing with data. Challenges and opportunities. Introduction. Science. 2011; 331(6018): 692-3.
03/12/2012 at 01:05:06 PM Organization ISA Commons: www.isacommons.org open, international collaboration To tackle complex scientific questions, experimental datasets from different sources often need to be harmonized in regard to structure, formatting and annotation so as to open their content to (integrative) analysis. Vast swathes of bioscience data remain locked in esoteric formats, are described using non standard terminology, lack sufficient contextual information or simply are never shared due to the perceived cost or futility of the exercise. This loss of value continues to engender standardization initiatives and drives the ongoing conversation about the encouragement of data sharing through appropriate reward mechanisms.

Minimum reporting guidelines, terminologies and formats (hereafter referred to generally as reporting standards) are increasingly developed and (partly) used in the structuring and curation of datasets, enabling data sharing to varying degrees. But the mountain of frameworks needed to implement the current wealth of domain-specific and fragmented reporting standards inhibits the development of tools for data management and sharing.

Descriptions of investigations of biological systems in which source material has been subject to several kinds of analyses (for example, genomic sequencing, protein-protein interaction assays and the measurement of metabolite concentrations) are particularly challenging to share as coherent units of research because of the diversity of reporting standards with which the parts must be formally represented. Equally, most repositories are designed for specific assay types, necessitating the fragmentation of complex dataset. One way forward is to establish reciprocal data exchange between major repositories, but budgetary constraints limit such activities and a crop of differing methodologies still imposes barriers. Researchers acting as data consumers also face challenges when the component parts of an investigation are scattered across databases. Fragmented datasets can only be reassembled by those equipped to navigate the various reporting guidelines, terminologies and formats involved. The research community requires solutions that accommodate the current 'wealth' of data sharing standards and resources, but hides it from users, thereby simplifying their efforts to meet (or ideally, exceed) applicable reporting requirements.

The ISA Commons (isacommons.org; Sansone et al., Nature Genetics, 2012) is an international set of service providers and researchers from around the world working to build an exemplar ecosystem of standards-compliant data curation and sharing solutions that (i) side-steps many of the problems caused by the dysfunctionally diverse standards landscape, and (ii) frees researchers to build an open data culture for the future. This broad and growing range of collaborative groups serve an increasingly diverse set of domains including environmental health, environmental genomics, metabolomics, (meta)genomics, proteomics, stem cell discovery, systems biology, transcriptomics and toxicogenomics, but also communities working to characterize nucleic acid structures and to build a library of cellular signatures. This emerging commons depends on its participants' using a common metadata tracking framework to aggregate investigations in community 'staging posts' following community-standards, merge them in various combinations, perform meta-analyses and more straightforwardly submit to public repositories.

Also related efforts, like BioSharing (biosharing.org), is set to play a pivotal role. Building on the MIBBI effort (mibbi.org), BioSharing works to strengthen collaborations between researchers, funders, industry and journals and to discourage redundant (if unintentional) competition between standards-generating groups, also mapping the landscape of standards and the systems implementing them.

The NIH, other funding agencies, journals and community initiatives encourage good data stewardship and sharing through the use of community reporting standards. Researchers need accessible, working tools now to share data, and new solutions are required that deliver economies of scale in data capture and standards compliance and inherently support data integration, rendering the process of data curation scalable in the face of the current 'data bonanza'.

The ISA Commons and BioSharing illustrates the importance of synergy and the potential of a horizontal approach that transcends individual life science domains and assay- or technology- focused communities and that should be supported and used an example.

PDF copy of article "Toward interoperable bioscience data" by Susanna-Assunta Sansone et al, published by Nature Genetics in 2012 (vol. 44 no. 2)
03/12/2012 at 01:21:10 PM Organization NIH LINCS Program members from several different US research institutions Response to NOT-OD-12-032 "Input into the Deliberations of the Advisory Committee to the NIH Director Working Group on Data and Informatics"
The text of our response (below) is also attached as a pdf in this online response form:

We are writing to share some of our thoughts on management, integration and analysis of large datasets. They are based on our experience working with data from many different types of cell-based and biochemical assays (monitoring gene expression, protein phosphorylation, and cell phenotypes via imaging). In the NIH LINCS program, we are collecting, integrating, and analyzing large datasets produced using these assays, and establishing public repositories to share the data.

SCOPE OF THE CHALLENGES/ISSUES
We note that there are sub-types of data that get more attention from NIH in terms of having well-maintained central databases (e.g. genomics and transcriptomics) while databases for other types of data (e.g. cell biological and biochemical) get less consistent support. These other types of data are equally as valuable as genomic/transcriptome data and, via high throughput assay technologies, are now being produced on almost the same scale. NIH must initiate new projects that support data repositories for these other data types.

Some aspects of this RFI involve application of established technologies to biological data, but solving the general problem of storing and analyzing large datasets in biomedicine is a "grand challenge" in computer engineering that will require fundamentally new approaches and tools. Funding should be provided for innovation in this area, perhaps in combination with NSF or organizations such a NASA (e.g. with the HDF5 group tackling remote earth imaging).

As we analyze the different datasets that are coming out of LINCS, it is clear that an extremely significant challenge lies in integrating data generated by the various assay types. For example, we are working to determine whether biochemical and imaging data on immediate-early cellular responses to cytokine and small molecule perturbagens collected by the Harvard Medical School LINCS Center (typically involving changes in the levels and activities of receptors, kinases and transcription factors) can be meaningfully linked to transcript-level data being collected by the Broad LINCS Center. To date, very few attempts have been made to link together such disparate sources of data. Semantic technologies that may facilitate such integration are now reasonably robust and several standards (RDF, RDFS, OWL, SPARQL) have been established by the W3C. However, most biologists do not know these technologies and novel end user tools that can be applied by biologists are required. Another requirement is implementation of descriptive ontologies (see standards).

STANDARDS DEVELOPMENT
STANDARDS FOR DATA DEPOSITION AND FORMATTING
We feel that it is important that database/computation experts and experimentalists collaborate to develop reasonable standards for data reporting and that NIH should encourage this. Such collaborations are necessary to ensure that data deposition standards are sufficiently complete to allow repositories to be mined productively and are also "minimal" enough that formatting data for deposition is not unnecessarily onerous.

Beyond metadata and formats, shared semantics are also critical to facilitate the integration of data from diverse assay types (described above). While ontologies to describe particular types of metadata exist, they are underdeveloped in the area of high throughput cell-based and biochemical assays such as those used, for example, in small molecule and RNAi screening.

We are piloting the use of the Investigation-Study-Assay (ISA) framework (Sansone et al. 2012 Nature Genetics 44:121-126) to facilitate tracking and formatting metadata (with ontologies) for LINCS experiments and it seems very promising.

ANNOTATING REAGENTS AND EXPERIMENTAL PROTOCOLS
Annotation of reagents (e.g. cell lines, targets of small molecule inhibitors, protein ligands, antibodies) used in the large-scale LINCS experiments has been unexpectedly time consuming as automated curation of reagent information is currently impossible. Standards development efforts supported by NIH should address this challenge. We believe that assay reagent vendors/providers could help also by providing catalogue information about their products in electronic formats that are standardized and easily uploaded into public databases. Tools that enable biologists to effectively use standards and ontologies to annotate experiments are critical.

OPEN STANDARDS AND DATA FILE FORMATS
NIH should push for open standards for data file formats, experimental metadata, and semantics, and for enabling tools developed with NIH support. It is important that instrument vendors too be encouraged to facilitate export of data generated on their instruments into recognized common file formats that are publicly declared (e.g. the DIACOM format for medical imaging or OME standard for microscope image files).


DATA ACCESSIBILITY
CENTRAL REPOSITORY: We support the idea of having data appendices linked to PubMed publications. However, we do not believe that one central repository should be created to manage all large biomedical datasets. Many (coordinated) repositories will probably be required. Such coordination requires data exchange standards, and shared standards and semantics describing the experiments and the results.

One of the challenges re: data accessibility that is faced by the extramural community is difficulty in finding useful data and databases relevant to particular projects. It is also often not clear how to find reliable, publicly available tools for data analysis. It would be extremely helpful if NIH maintained a master, up-to-date directory describing the databases and analysis tool development projects that it supports.

DISTRIBUTED QUERYING: An infrastructure to query across several repositories that hold diverse data would probably potentiate the value of each individual repository, because scientific insight and new discoveries rely on putting different pieces of information together. A distributed data management infrastructure is probably more realistic than one massive central repository. Because research questions and use cases from various investigators are hard (or impossible) to foresee, a system that does not rely on a fixed data model or data structure will likely be an advantage. A truly distributed data management infrastructure requires rigorous data exchange standards, reasonable minimum information, appropriate metadata standards, standardized data formats, and semantic formalization of the data types. While the vision of the Semantic Web is not new, several technological challenges have to be overcome and enabling tools that can be used by chemists and biologists are a critical requirement. Such an infrastructure should be long-term goal of the NIH to maximize the investment into the numerous data production projects it supports.


INCENTIVES FOR DATA SHARING
Incentives will be particularly important at the outset as data repositories are implemented because the costs of data annotation and formatting will be higher than the payback in the beginning. The cost-benefit ratio for individuals will improve over time, so direct "subsidies" will be less necessary. Incentives could include consideration on grants, etc. (as proposed in the RFI) and, at the outset, supplemental funding (administrative and competitive revisions to existing grants) for implementation or use of public repositories.

Implementation of unique identifiers to track deposited data (e.g. DOIs, digital object identifiers, assigned to datasets) could provide incentives for data sharing if they could be considered as part of one's publication record for grant review and promotion purposes, in the same way that patents are considered. In addition they would provide an easy way for others to reference data used in research projects.

SUPPORT NEEDS
ANALYTICAL AND COMPUTATIONAL WORKFORCE GROWTH, FUNDING FOR TOOL DEVELOPMENT ETC.
Workforce growth and more funding in these areas will be essential to support production of and public access to an increasing number of large biomedical datasets. More investigators with expertise in data collection, data curation/formatting, database and algorithm design, programming, semantic integration / Semantic Web technologies, and data validation will be required.

Long term storage of data will also have to be addressed. One of the problems with proposing to develop and support new repositories is that it looks to be 30-year commitment. The Committee should address whether that is the right scale. Will there be certain kinds of data that can be retired over time?

We believe that NCBI might not be the optimal solution for maintaining all NIH-supported repositories for large datasets. Alternatives should be considered, for example NIH support of external research consortia that manage data repositories such as the consortium that manages the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Databank.

EDUCATION
Education at all levels will be essential for implementing the policies outlined in this RFI. Curricular materials will be needed to familiarize students, postdocs and group leaders/faculty with approaches to data annotation and storage. The fundamental principles of informatics are not well known by most experimental biologists.
    Text of attachment same as that of comment boxes.
03/12/2012 at 01:49:01 PM Self     The text of our response (below) is also provided as a pdf attachment to this online response form.

We are writing to suggest that the NIH strongly consider a new initiative that will establish a robust public database for RNAi data. We wrote about this topic recently in a Comment piece for Nature Cell Biology that was published in February 2012 (Nat. Cell Biol 14:115), and here we summarize key points of our arguments.

No database of record for RNAi data has been established and widely accepted. The ideal public RNAi data repository would include: -Raw and annotated RNAi screening results -Well-documented experimental and data analysis protocols -Sequences of RNAi reagents

Such a repository would provide useful information to annotate individual gene function and to study on-target and off-target RNAi mechanisms, thus expediting improved design of RNAi reagents and making research more efficient.

To establish this repository, a collaboration between database/computation experts and experimentalists will be necessary to develop reasonable standards for data reporting. This will ensure that data deposition standards are sufficiently complete to allow the repository to be mined productively, but are also “minimal” enough that formatting data for deposition is not unnecessarily burdensome. Help from assay reagent vendors/providers will be important too, to facilitate documentation of RNAi reagents, cell lines and other reagents used in RNAi screening experiments.

Although many investigators see the benefits of making RNAi data public, it will also be important that funding agencies and journals provide incentives to deposit RNAi data and make them broadly accessible.

Fortunately, there are multiple efforts already underway that provide a solid foundation of experience on which to develop an RNAi data repository and on which the NIH should build. Many groups are working out relevant data standards, including MIARE (Minimum Information about an RNAi experiment, http://miare.sourceforge.net/">http://miare.sourceforge.net/), MIACA (Minimum information about a Cellular Assay, http://miaca.sourceforge.net/">http://miaca.sourceforge.net/), the Investigation-Study-Assay framework (ISA, Sansone et al. 2012 Nat. Genetics 44:121-126), and the NIH LINCS program (http://lincsproject.org">http://lincsproject.org). The efforts to create community standard for large-scale perturbation experiments are converging, establishing the framework for public data repositories.

Also, public databases exist for RNAi data from model organisms, such as WormBase (C.elegans), and the GenomeRNAi (http://www.genomernai.org">http://www.genomernai.org ) and DRSC (http://www.flyrnai.org">http://www.flyrnai.org ) databases (Drosophila). Mammalian RNAi data are beginning to be deposited in the NCBI PubChem Bioassay database (with RNAi reagents are compiled in the NCBI Probe database). However, we understand that long-term support for RNAi data in PubChem Bioassay is uncertain.

With an increasing number of initiatives aiming at systematic target discovery supported by public funding (e.g. CTD2/NCI, EU-funded networks), we believe there is an urgent need to establish a public database to make large-scale perturbation data sets from RNAi screens available for subsequent bioinformatic re-analysis and integration.

Establishment of one or few public repositories for RNAi-data will be essential to make the growing number of RNAi datasets available for public access and to thus guarantee long-time utility. Consideration of this issue by the Advisory Committee would create major impact by fostering the awareness in the scientific community, with publishers and funding agencies.

    Text of attachment same as that of comment boxes.
03/12/2012 at 02:16:21 PM Organization University of Illinois Urbana To understand the issues involved in managing and analyzing large data sets, one must first understand the limitations posed by the available computing hardware. These limitations fall into three classes: Data sets that are too big to fit in the memory of the largest systems in use today. A single node in a cluster may host 1-2 TB of RAM, and the total memory available in the largest HPC facilities may be in the low PB range. Any dataset larger than this cannot be fully loaded into memory at once; this puts significant constraints on the analytical methods that can be deployed. While some datasets can be split into independent subsets that can be processed in parallel and thus do not need to fit into main memory, any analysis that requires an all to all comparison is limited by the amount of physical RAM available in the system. Such problems in distance computation occur in bioinformatics (e.g. extracting gene functions from biological literature), medical informatics (e.g. comparative effectiveness of drug regimens for different demographics), and health informatics (e.g. determining population cohorts from health vectors derived from mobile sensors). Data sets too big to fit in a typical disk array for a large-scale computing system. For the largest HPC systems, the online disk systems are on the order or 20-50 petabytes. Datasets exceeding this size will need to be handled in chunks loaded serially from much slower, tertiary storage. This shuffling of data can add substantially to the length of time needed for the analysis. Data sets too big to be stored. If the data set exceeds the size of the largest scale tape systems, which is currently 300-500 petabytes for a high-end tape library, there is no practical way that they can be analyzed or presented to users, except by pre-processing them and keeping summary data or statistics. Since storage density continues to grow exponentially, NIH will be able to store massive data sets. However, the analytical techniques that work for terabyte-size data sets are unlikely to work for petabyte-size data sets and will certainly not work for exabyte-size data sets. So, NIH will not only need to invest in the hardware to store and manage the data, but will also need to invest in the development of new analysis algorithms and techniques. I/O performance issues Data cannot be separated from the way in which they will be used. This impacts the way in which the data is structured and the storage formats (second major bullet, Standards Development, in the RFI Background list). When dealing with small amounts of data, the focus can be on, for example, ease of use; when dealing with large amounts of data, the additional requirements of fast and reliable access must be fully taken into account. The level of effort involved in developing effective data storage and management techniques and practices can be substantial, as we have learned firsthand from managing astronomical and physics data produced by major experimental facilities. NIH will undoubtedly be dealing with even larger amounts of data, produced more rapidly and generated in multiple sites. More pointedly, as NIH moves to larger and larger data sets, and federations of data sets, it will discover that the I/O performance of most systems will be inadequate to handle the volume of data in a timely fashion. Solving this problem requires getting many things right, from organizing the data so that it can be accessed efficiently, to picking representations that allow it to be manipulated efficiently in the available memory of the computer systems, to developing algorithms and data management interfaces that work well with peta- to exabytes of data, and, last but not least, to designing the storage and I/O systems to maximize the transfer rate between disks and memory. Security for data management is a difficult problem with many subtle issues; biomedical and health data are particularly challenging. In some cases, including the well-publicized aggregate GWAS study results, restricted information can be obtained by asking the right questions about the data. Progress has already been made in this area, particularly in cryptographic techniques that restrict the type of knowledge that can be extracted. The main lesson so far has been that security must be part of the original design. It makes sense to further involve experts in security and trust research, such as the Information Trust Institute at Illinois (already the lead partner in the HHS-sponsored SHARPS project). In summary, the data problem must be considered in the context of how the data is used, considering both the ability to efficiently make use of it and to maintain the security of the data, particularly when there are ethical and legal requirements. We welcome the opportunity to comment on the NIH request for information regarding the management, integration and analysis of large biomedical datasets. Our comments are made from the perspective of a center that has extensive experience in the management of large datasets produced by other scientific fields, in particular astronomy and physics. These fields present a number of the technical challenges that will also be encountered when transitioning from "traditional" biomedical datasets whose sizes are measured in giga- or terabytes to those that are being generated today and in the coming years, where peta- or exabyte datasets will be the norm. It is often argued that biomedical research is moving so rapidly that it is difficult to develop a coherent approach to managing and analyzing the data being produced in ever larger quantities. And, this is, to some extent, true. However, many other research fields, including those with which we are associated, are undergoing similar, if not quite so rapid, changes. For example, the observational astronomy community reprocesses all of the data from their telescopes periodically to take advantage of improvements in the algorithms and techniques for analyzing the image data. The Large Synoptic Survey Telescope project, with its 3 gigapixel camera, will eventually gather over 100 petabytes of data, which will be re-analyzed at least annually. Background The University of Illinois faculty, staff, and students conduct millions of dollars in biological research every year. Indeed, scientists on our Urbana campus are pursuing some of the country's most important biological research in genomics and medicine, looking for critical solutions to drive energy independence, developing cutting-edge materials to enable advanced manufacturing, and pursuing novel solutions to information technology challenges. Our Chicago campus, with its medical school and healthcare system, receives millions of dollars in federal funding to address health disparities by developing novel drugs, innovative treatments, and social programs. Since 2000, the University has increased its research budget by more than 50% to over $900 million today, catapulting it to the top Illinois research university. The University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications (NCSA) have been at the forefront of computing technology since the 1950s. NCSA has served the science and engineering community for more than 25 years, providing these communities with computing, data and visualization resources and working closely with them to take full advantage of these resources for advancing scientific discovery and the state of practice in engineering. NCSA is currently building the largest computing system ever to be deployed in an academic center for the National Science Foundation. Blue Waters will have a peak performance nearly 12 petaflops, 1.5 petabytes of memory, 26 petabytes of on-line disk storage and 380 petabytes of archival (tape) storage. Basic Considerations To understand the issues involved in managing and analyzing large data sets, one must first understand the limitations posed by the available computing hardware. These limitations fall into three classes: Data sets that are too big to fit in the memory of the largest systems in use today. A single node in a cluster may host 1-2 TB of RAM, and the total memory available in the largest HPC facilities may be in the low PB range. Any dataset larger than this cannot be fully loaded into memory at once; this puts significant constraints on the analytical methods that can be deployed. While some datasets can be split into independent subsets that can be processed in parallel and thus do not need to fit into main memory, any analysis that requires an all to all comparison is limited by the amount of physical RAM available in the system. Such problems in distance computation occur in bioinformatics (e.g. extracting gene functions from biological literature), medical informatics (e.g. comparative effectiveness of drug regimens for different demographics), and health informatics (e.g. determining population cohorts from health vectors derived from mobile sensors). Data sets too big to fit in a typical disk array for a large-scale computing system. For the largest HPC systems, the online disk systems are on the order or 20-50 petabytes. Datasets exceeding this size will need to be handled in chunks loaded serially from much slower, tertiary storage. This shuffling of data can add substantially to the length of time needed for the analysis. Data sets too big to be stored. If the data set exceeds the size of the largest scale tape systems, which is currently 300-500 petabytes for a high-end tape library, there is no practical way that they can be analyzed or presented to users, except by pre-processing them and keeping summary data or statistics. Since storage density continues to grow exponentially, NIH will be able to store massive data sets. However, the analytical techniques that work for terabyte-size data sets are unlikely to work for petabyte-size data sets and will certainly not work for exabyte-size data sets. So, NIH will not only need to invest in the hardware to store and manage the data, but will also need to invest in the development of new analysis algorithms and techniques. I/O performance issues Data cannot be separated from the way in which they will be used. This impacts the way in which the data is structured and the storage formats (second major bullet, Standards Development, in the RFI Background list). When dealing with small amounts of data, the focus can be on, for example, ease of use; when dealing with large amounts of data, the additional requirements of fast and reliable access must be fully taken into account. The level of effort involved in developing effective data storage and management techniques and practices can be substantial, as we have learned firsthand from managing astronomical and physics data produced by major experimental facilities. NIH will undoubtedly be dealing with even larger amounts of data, produced more rapidly and generated in multiple sites. More pointedly, as NIH moves to larger and larger data sets, and federations of data sets, it will discover that the I/O performance of most systems will be inadequate to handle the volume of data in a timely fashion. Solving this problem requires getting many things right, from organizing the data so that it can be accessed efficiently, to picking representations that allow it to be manipulated efficiently in the available memory of the computer systems, to developing algorithms and data management interfaces that work well with peta- to exabytes of data, and, last but not least, to designing the storage and I/O systems to maximize the transfer rate between disks and memory. Security Security for data management is a difficult problem with many subtle issues; biomedical and health data are particularly challenging. In some cases, including the well-publicized aggregate GWAS study results, restricted information can be obtained by asking the right questions about the data. Progress has already been made in this area, particularly in cryptographic techniques that restrict the type of knowledge that can be extracted. The main lesson so far has been that security must be part of the original design. It makes sense to further involve experts in security and trust research, such as the Information Trust Institute at Illinois (already the lead partner in the HHS-sponsored SHARPS project). In summary, the data problem must be considered in the context of how the data is used, considering both the ability to efficiently make use of it and to maintain the security of the data, particularly when there are ethical and legal requirements.
03/12/2012 at 02:25:00 PM Organization The SBML project (at the California Institute of Technology) Pasadena, California, USA For all branches of science to make progress, our published results must be reproducible. Many scientists think that reproducibility is adequately achieved by publishing a paper in which they describe their methods and results. This is far from the truth, and anyone who has tried to replicated published work will surely agree. The sad reality is that it is the exception, rather than the rule, when results published in a paper can be reproduced by another group.

Yet in many areas of science today, we can do better, because we rely more and more on computational methods, software tools, and databases. Computational artifacts, by their nature, lend themselves to sharing that is more direct and effective than simple narrative descriptions published as papers. When it comes to science that has a computational component, as Donoho wrote in 1998, an article "is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures" (...) In other words, we should be publishing our code and our data, not just descriptions of what we did in a particular scientific study.

It is sometimes not obvious just how crucial it is to be able to see the real software code, numerical data, and/or models actually used in a particular study. I can use an example from an area in which I am familiar, but analogous problems exist in other areas and applications. I am involved with a project to provide a public database of published computational models in biology. Every model in the database (and there are now several hundred) is curated by humans who take papers and for each one, try to reproduce the computational models such that they can faithfully reproduce at least one figure in the publication. Mind you, these are published models - every one of them has appeared in a peer-reviewed publication. The curators' failure rate for this has been unbelievably high: until recently, they were unable to reproduce somewhere between 95-99% of all publications they try. Their success rate has significantly improved now that authors more often provide them directly with their models in SBML format, a community standard; the curators no longer have to recreate the model solely from the published papers. Their success rate is still not perfect, but it is far better.

The current policies in NIH RFAs and PAs only ask that applications describe plans to share data and software products created by their publicly funded research. They do not (at least in the cases I have seen) actually require funded projects to share their data. It would not seem unreasonable for NIH to require that projects share data in standard (or at least commonly-accepted formats), especially if those formats were developed thanks to NIH funding in the first place. (Case in point from my field of work: the SBML format for computational models in biology, described at http://sbml.org and supported by hundreds of software tools worldwide today.) Please see the last paragraph in my answer to Comment 1.  
03/12/2012 at 03:10:11 PM Self     "Incentives for data sharing - Standards and practices for acknowledging the use of data in publications"

An increasing number of large-scale data sharing projects (e.g., ADNI, PPMI) require data users to agree to data use conditions that are in direct conflict with academic authorship standards.

These projects require inclusion of the respective initiative on the author list of published journal papers that use shared data, even when the data use is completely coincidental and not related to the design of the data sharing initiative. See the attached NeuroImage Comment article for details and background.

As a result, the data produced and shared by these initiatives can only be accessed by scientists who are either unaware of authorship standards, willing to ignore these standards, or willing to agree to a data use agreement with no intention to abide by it.

Shared data created with public funding should be usable without requiring ethical compromises, especially when the funding is provided specifically for the creation and release of such data.

It cannot be in the interest of the scientific community if a significant number of scientists is unable to use shared data. More significantly perhaps, it is not in the interest of the community if publicly-funded shared data favors researchers with loose ethical standards by granting them exclusive access to a valuable resource.

NIH should establish and enforce guidelines to ensure that incentives for data sharing do not compromise existing standards in the scientific community, such as for example standards of academic authorship.

Current guidelines are not sufficient or not enforced. For example, the original RFA under which specifically ADNI was established (https://grants.nih.gov/grants/guide/rfa-files/RFA-AG-04-005.html), stated the clear requirement to put the created data into the public domain. Instead, these data are tightly controlled and used to re-define long-standing principles of academic authorship without any proper authority to do so.

As more large-scale initiatives adopt substantially identical Data Use Agreements, there is urgent need for the funding bodies, specifically NIH, to place a stronger emphasis on the conditions under which data are shared. Purpose-specific funding of shared data creation should be conditional on the widest and most liberal distribution possible. Under no circumstances should access to publicly funded, shared data require ethical compromises.

PDF copy of article "Why shared data should not be acknowledged on author byline" by T. Rohlfing and J.-B. Poline, Published by NeuroImage in 2011
03/12/2012 at 03:45:57 PM Self     Discussions of technology, data standards, data accessibility are important, but there is a pressing need to bring the current, outdated model for academic advancement, into one that reflects the sea change in how science is done in the 21st century. This will require a concerted and committed effort from academic institutions and NIH alike. The more advantageous that sharing of quality, fully-annotated datasets is to a scientists's career advancement, the more likely it is that it will happen.

NIH must implement a system for uniquely identifying NIH-funded scientists and a means of uniquely and accurately associating these individuals with publications and deposited datasets. This system must become as widely adopted as PubMed has been. There is little excuse in the this computationally-driven era to have disambiguity of authorship persisting as it does.

In addition to supporting the analytics and technical workforce, NIH must commit funding to the database curation workforce and make this an attractive alternative career path for scientists. Data repositories that do not have good quality data and metadata do science a disservice - it is incumbent upon us to ensure that data repositories are staffed by well-compensated, highly-motivated PhD-level curators, whose job it is to annotate datasets to a high degree of accuracy and to ensure data integrity.

At the academic level there must be recognition for deposition of datasets. NIH must lead the way on this, and recognize these activities, then this will eventually trickle down to the level of individual academic institutions. One solution is to add dataset deposition information as a separate section in the required NIH biosketch, and as a separate section in a scientist's CV. Evidence of deposition reflects well on a scientist's appreciation for the fact that sharing data is vital to its full leverage by the research community, and this character trait is one that should be taken into account by promotional committees at the academic level and by study sections at the national review level.

The bottom line for NIH and grantees alike is maximizing the impact of federal research dollars. Currently there is no culture of data sharing in the US biological research community. Practical measures that encompass recognition in academic advancement and NIH funding must be implemented to encourage such a culture to the point that it becomes habitual. Carrot and stick policies must operate side by side on the institutional level and national level to bring the culture about. The NIH must become less passive with regard to enforcing data sharing by its grantees. If grantees are spending federal research dollars, it is incumbent upon them to preserve the research that these dollars purchase. Requiring deposition of full manuscripts with PubMedCentral within one year was a very welcome development, but Congress must empower NIH to require deposition of full, annotated datasets relevant to these papers in appropriate repositories. Whether it is a carrot (special consideration for PIs who deposit datsets) or stick (negatively impact on future funding for PIs who do not deposit datasets) policy, something must be done to prevent the loss of millions of data points every month to posterity. Our own research shows that only 50% of microarray datasets from NIH-funded publications are deposited in public repositories (see Attachment 1). An appropriate time to do this is when PIs apply for renewal of grants - any paper that involved an 'omics-scale dataset must included an accession number for that dataset. PDF Copy of "Much room for improvement in deposition rates of expression microarray datasets" by Scott A Ochsner et al, published by Nature Methods in 2008 (vol. 5 no. 12)
03/12/2012 at 04:13:41 PM Organization Association of Independent Research Institutes (AIRI) Westminster, MD AIRI recognizes the transformative nature of large-scale datasets on biological research, that information is at the core of our scientific activities and that our ability to collect, manage and understand data is vital to the success of our institutions. Research efforts of particular concern, genomics, molecular and phenotypic profiling, and imaging, are increasingly adopting high-throughput techniques that produce unprecedented demands on data storage, management, and analysis. The rapid growth of large-scale biomedical datasets places exceptional demands on Information Technology Infrastructure. Petabyte-scale storage systems, along with high-performance compute clusters and data networks, are tremendously expensive to procure, maintain and operate; finding financially sustainable models to achieve the necessary infrastructure is essential for the viability of our institutions. AIRI believes that NIH has a key role in structuring this process. With data generated in the petabytes, requirements for data preservation and sharing must be reasonable. Individual research institutes cannot be called upon to store and share data beyond the reasonable depreciation schedule of the funded storage systems. Central, federally funded data repositories, such as those maintained by the NCBI, are vital to preserve and make available research data in the long term. The Association of Independent Research Institutes (AIRI) welcomes the opportunity to provide input into the deliberations of the Advisory Committee to the NIH Director (ACD) Working Group on Data and Informatics (DIWG). The Request for Information recognizes that the growth of large biomedical datasets presents tremendous opportunities for discovery along with significant challenges, and we applaud the ACD for seeking feedback from the research community. AIRI is a national association of 86 independent, not-for-profit research institutes whose primary mission is research. Our relatively small size and greater flexibility provide an environment that is particularly conducive to scientific creativity and innovation. Independent research institutes receive 10 percent of the National Institutes of Health (NIH) peer-reviewed, competitively awarded extramural grants. On average, AIRI members receive over 10 percent of their funding from the National Science Foundation (NSF) and nearly half of the AIRI institutes receive Department of Defense (DOD) funding. AIRI recognizes the transformative nature of large-scale datasets on biological research, that information is at the core of our scientific activities and that our ability to collect, manage and understand data is vital to the success of our institutions. Research efforts of particular concern, genomics, molecular and phenotypic profiling, and imaging, are increasingly adopting high-throughput techniques that produce unprecedented demands on data storage, management, and analysis. The rapid growth of large-scale biomedical datasets places exceptional demands on Information Technology Infrastructure. Petabyte-scale storage systems, along with high-performance compute clusters and data networks, are tremendously expensive to procure, maintain and operate; finding financially sustainable models to achieve the necessary infrastructure is essential for the viability of our institutions. AIRI believes that NIH has a key role in structuring this process. With data generated in the petabytes, requirements for data preservation and sharing must be reasonable. Individual institutes cannot be called upon to store and share data beyond the reasonable depreciation schedule of the funded storage systems. Central, federally funded data repositories, such as those maintained by the NCBI, are vital to preserve and make available research data in the long term. In 2010, AIRI organized the Petabyte Challenge, a special meeting with the goal to assess and measure the explosive growth in scientific data; discuss and evaluate case studies of recently architected systems by peer AIRI institutes; explore the most promising technologies; forecast the current and ongoing cost of alternative solutions; and evaluate cost-recovery models that could be used to finance acquisition and operation of large-scale systems. At that time, the majority of attendee institutions were still hosting less than a petabyte of data; however, they predicted that data storage demands would double over the next three years. The 2010 Petabyte Challenge Meeting focused largely on the challenges associated with the effective design and operation of large storage systems. Recognizing that, two years hence, these technical challenges are much better understood and the topics of the upcoming AIRI 2012 Petabyte Challenge are wider in scope. The theme of the AIRI 2012 Petabyte Challenge Meeting is "From Data Storage to Data Management." At this stage, several AIRI institutions manage well over a petabyte of data, some multiple petabytes. We would happily welcome any members of the ACD DIWG to our upcoming Petabyte Challenge Meeting. More information on the AIRI Petabyte Challenge can be found at http://www.airi.org/membership/about/detail.aspx?id=447. 2 We look forward to providing the ACD DIWG with additional information following the AIRI 2012 Petabyte Challenge, and thank the working group for examining the challenges faced by the research community in managing large biomedical data sets.
03/12/2012 at 04:42:18 PM Organization Association of American Colleges Washington, DC The key areas, critical issues and impacts, primarily on institutions, are discussed in the attached letter, on behalf of the Association and constituents. The key areas are training and workforce, sustainability, community-led standards and methodologies, and patient-centered outcomes. please see attached. please see attached. The Association of American Medical Colleges (AAMC) is grateful to the Working Group on Data and Informatics for the opportunity to comment on issues relating to management, integration, and analysis of large biomedical and public health datasets. We commend the Advisory Committee to the Director (ACD) for undertaking a timely examination of this topic. The AAMC represents all 137 U.S. allopathic medical schools, nearly 400 teaching hospitals and health systems, 68 VA medical centers, and almost 90 academic societies. Our comments here reflect input from many constituents, primarily collected through AAMC's Group on Information Resources (GIR); Graduate Research, Education, and Training (GREAT) Group; and Group on Research Advancement and Development (GRAND). Our overarching view, and one that we believe is shared by the working group, is that technological change is reshaping both science and society, and that data resources in academic medical centers have grown not only in volume and complexity, but also have changed qualitatively. Where datasets were once largely assemblages created by and associated with particular research projects, programs, or departments, increasingly data resources have more heterogeneous origins - including those generated by delivery systems and communities, and even from portable data devices and "crowd sourcing." Further, many of these resources are seen as a fundamental part of institutions' infrastructure, personnel, or administrative capacity. Data without structure, tools, and quality has limited value. Growth and Sustainability: The AAMC, in partnership with the National Library of Medicine, recently hosted a Summit on the "Digital Research Enterprise," convening representatives from across the academic medical community with responsibility for maintaining and developing these resources. The Summit envisioned an emerging environment in which multitudes of investigators, in collaborations or individually, contribute to and draw from "big data" both to explore the molecular basis of human health and disease and to innovate health care. Institutions face many challenges for managing and nurturing these resources, and protecting their integrity. It is vital to bring front-line communities and leadership to this effort to identify ways to leverage current investments for long-term value and to drive future strategies. We also know that in addressing these challenges, the community should not expect substantial increases in public funding, at least not at levels used to establish new infrastructure in past eras. The focus of the working group recommendations should include helping institutions direct or redirect resources in ways that offer long-term sustainability and growth of systems thereby providing institutions a return on investment. Also, given the complexity of the issues, recommendations with clear and relatively short term impact will be especially valued. Workforce development: The training and development of a research workforce that is skilled in the use and development of these resources is a critical concern. Just as the proportion of costs allocable to data storage from sequencing centers has grown significantly and have impacted center budgets, so too is the human resources budget shaped by these demands. We see this as a growing trend in the years and decades to come: an increasing proportion of the talent required to conduct modern biomedical research (either molecular or patient-based) will be focused on the data lifecycle. Recommendations: Given the near decade-long training process for biomedical PhDs and the urgency of ensuring a highly skilled workforce, NIH should advance a coherent strategy for quickly implementing training for informatics and analysis of large data sets. NIH should help increase awareness and incentivize development of such programs, which could be within existing training and career development programs. For example planning for informatics and data training could be encouraged within training program applications and progress reports. AAMC earlier recommended to the NIH ACD Working Group on the Biomedical Research Workforce that success of training programs should be defined broadly, and that the evaluation of training programs be revised accordingly. Training on research project grants (R01s) should also include training plans, allowing for broad scope and diversity in research careers, including in informatics and data analyses. These efforts in training for informatics and data management will contribute to both the development and innovation of new health systems and also create new job types in biomedical research not only for NIH-supported academic research, but also for the private sector. Fostering community-led standards and methodologies: NIH has had great success in developing and supporting large data resources in genomics, proteomics, and publications primarily by fostering standards and consistent policies that have broad support across the research community, and by ensuring research community access to these databases. Recommendations: NIH should continue to help foster new data standards so that datasets will be well organized, protected, accessible and reusable to enable research and allow for future interoperability. This should include expansion of NCBI text databases to permit further "mining" of data and information within the literature. Data standards should reflect learning from other communities and take into consideration the increased value of data when developed with an eye toward future (re)use and potential integration with other data types drawn from other areas within biomedical and health research. Given that health and biomedical research is increasingly international, standards and methodologies should be able to integrate on a global scale where appropriate. Patient-centered outcomes data: The AAMC strongly supports new federal research initiatives that focus on improving the quality and effectiveness of health care, and that seek to ensure that clinical findings are implemented within medical practice in a timely way. Such initiatives include community-based participatory research, and research to address health disparities and improve health equity. Much emerging research depends upon development of electronic medical records, such as guided by the evolving "meaningful use" standards of the Centers for Medicare and Medicaid Services (CMS), or currently used in the Department of Veterans Affairs. Recommendation: The Working Group should specifically note in its recommendations the increasing importance of patient-centered health outcomes data, such as supporting the research of the Patient-Centered Outcomes Research Institute (PCORI), and the vital role of such data in improving the quality, reliability, and effectiveness of health care delivery. Given the accelerating pace at which information technologies and data resources develop, the AAMC has focused on relatively short-term, specific objectives such as for training and workforce development above. The Working Group hopefully will be positioned to specify more detailed and actionable recommendations that NIH, in partnership with the research community, can implement. The Working Group should also include milestones and measures to ensure that objectives are met. AAMC is willing to provide additional input that would be helpful to the working group in developing its recommendations. For assistance or more information, please contact [contact information redacted].
03/12/2012 at 04:44:31 PM Organization Booz Allen Hamilton Rockville, MD Identifying complex patterns in biomedical research and health care delivery data using modern analytical and data mining techniques. Aggregating the data (both current and future) in a form that is compatible with large data set processing techniques such as key/value pairs instead of conventional databases and adding the necessary security data to the keys so that appropriate privacy protections can be provided. The issues identified above should cause NIH to convene stakeholders and those conversant in analytics to identify areas of focus for investigation using the techniques and data mentioned in comments 1 and 2.

Attachment #1:

03/12/2012 at 04:54:50 PM Organization University of North Carolina at Chapel Hill Chapel Hill, NC With large-scale data creation throughout the biomedical clinical research community increasing rapidly, the need for end-to-end life-cycle data management and fostering of effective data sharing, becomes paramount. In this white paper, we propose a paradigm for national data infrastructure for large-scale biomedical data collections - through development of "shared contexts" at multiple life-cycle stages that are governed by well-defined machine-executable policies. We also propose a layered architecture for implementing such a national data infrastructure and provide a use case that illustrates where such a system is being attempted.

The central concept is the realization that collaboration requires the formation of a shared collection. Researchers rely upon a common name space for identifying files, common semantic terms to understand the context associated with each file, shared procedures for manipulating and analyzing the data, and a shared consensus on the policies for managing the data. The shared context enables a designated community to understand and use the shared collection.

In a national data infrastructure, each stage of the community-driven collection life cycle is governed by an explicit set of policies that represent the community consensus on data sharing. The policies enforce assertions about collection properties, such as internal consistency, completeness, authoritativeness (data source), authenticity, coverage, and reliability. The policies appropriate for a local project may allow the local collection to contain intermediate results, output from trial analyses, and un-vetted results. The policies for a shared collection (a digital library) may require that all data be calibrated, evaluated for internal consistency, and explicitly approved for publication. Thus the policies for each stage of the collection life cycle are different. The procedures used to apply the policy will also differ, along with the state information that records the outcome of policy enforcement. A community-driven collection life cycle requires a highly extensible architecture in which policies, procedures, and state information - a shared context - evolve as the user community broadens.

Research information lifecycle data access related to access from distributed repositories of research data models and technical solutions for distributed querying of data and metadata Comprehensive authentication and authorization procedures long-term data sharing with relationship to data standards, reference sets, and capturing provenance, auditing and accountability leading ownership and citation for data usage A policy-based data management environment has the ability to validate assessment criteria and verify that an intended use of a collection is consistent with the published context. An assessment policy can be periodically applied to verify that a desired collection property has been conserved. An example is the verification of an integrity policy that requires that each file be associated with access control policies and validated for authenticity. The assessment policy periodically calculates each file for fidelity with the policies, compares with contextual requirements, and repairs any discrepancies. There is a very strong link between auto mation of administrative functions and periodic execution of assessment criteria. It is possible to design policies that enable properties of a collection to be verified with automatic correction of detected errors. As collections grow to the multi-petabyte size and are distributed across multiple locations, such automation is essential to minimize administrative labor costs. National Data Infrastructure for Integration of Large Biomedical Data Collections With large-scale data creation throughout the biomedical clinical research community increasing rapidly, the need for end-to-end life-cycle data management and fostering of effective data sharing, becomes paramount. In this white paper, we propose a paradigm for national data infrastructure for large-scale biomedical data collections - through development of "shared contexts" at multiple life-cycle stages that are governed by well-defined machine-executable policies. We also propose a layered architecture for implementing such a national data infrastructure and provide a use case that illustrates where such a system is being attempted. The central concept is the realization that collaboration requires the formation of a shared collection. Researchers rely upon a common name space for identifying files, common semantic terms to understand the context associated with each file, shared procedures for manipulating and analyzing the data, and a shared consensus on the policies for managing the data. The shared context enables a designated community to understand and use the shared collection. A significant challenge is that the shared context depends upon the knowledge available to the designated community. The original creators of a data set have a highly sophisticated understanding of the methods that generated the data and only need a simple context based on the name of the application and input parameters for simulation data. When the data are shared with a broader community, the context needed to describe the data now should include information about the generating application, the coordinate system in which the data are embedded, the units used to define physical quantities, and the relationships between the physical quantities. When the data are published in a digital library, the context associated with the data must meet the semantic standards of the discipline, including a standard format (data structure), ontologies that map relationships between semantic terms, and access policies for allowed use of the data. When the data are preserved, the required context must meet the needs of future researchers for correct interpretation and processing of the data. The shared context evolves to provide a more detailed description of the data as the user community broadens. The evolution of the shared context constitutes a community-driven collection life cycle. An example life cycle is the migration of data from a project collection, to a collection shared with other researchers, to a digital library for formal publication of vetted results, to a reference collection for use by future researchers. In a national data infrastructure, each stage of the community-driven collection life cycle is governed by an explicit set of policies that represent the community consensus on data sharing. The policies enforce assertions about collection properties, such as internal consistency, completeness, authoritativeness (data source), authenticity, coverage, and reliability. The policies appropriate for a local project may allow the local collection to contain intermediate results, output from trial analyses, and un-vetted results. The policies for a shared collection (a digital library) may require that all data be calibrated, evaluated for internal consistency, and explicitly approved for publication. Thus the policies for each stage of the collection life cycle are different. The procedures used to apply the policy will also differ, along with the state information that records the outcome of policy enforcement. A community-driven collection life cycle requires a highly extensible architecture in which policies, procedures, and state information - a shared context - evolve as the user community broadens. A policy-based data management environment has the ability to validate assessment criteria and verify that an intended use of a collection is consistent with the published context. An assessment policy can be periodically applied to verify that a desired collection property has been conserved. An example is the verification of an integrity policy that requires that each file be associated with access control policies and validated for authenticity. The assessment policy periodically calculates each file for fidelity with the policies, compares with contextual requirements, and repairs any discrepancies. There is a very strong link between automation of administrative functions and periodic execution of assessment criteria. It is possible to design policies that enable properties of a collection to be verified with automatic correction of detected errors. As collections grow to the multi-petabyte size and are distributed across multiple locations, such automation is essential to minimize administrative labor costs. Each user community is accustomed to a preferred user interface. As the user community broadens, the set of user interfaces to the collection should also broaden. Thus the original project might prefer a file-system interface and directly manipulate data within the shared collection through Unix tools. The preferred access mechanism for the shared collection might be a data processing pipeline based on a standard workflow system, while the preferred access to the digital library may be a web browser. In common practice, there are roughly eleven types of access methods used within research disciplines: File systems, Unix tools, Scripting language load libraries, Workflow systems, Web services, Portals, Grid tools, Digital libraries, Dropbox, Web browsers, and I/O libraries. The national biomedical data infrastructure should provide multiple interfaces for the client front-end to make it easier to access and manipulate data. A national data infrastructure for biomedical data is probably not situated in a single site or administered by a single entity. Hence the infrastructure must also manage interactions between communities that have chosen different management policies. The concept of federation is based on the idea that a new virtual shared collection can be created that represents the data-sharing consensus between the separate communities. Each community can choose to enforce local management policies within their data grid, and then establish the additional policies that govern the shared collection. Each community can retain local control while enabling researchers to collaborate through federated collections. Layered Architecture for Integration of Existing Capabilities for Biomedical Datasets: Multiple communities are developing cyber-infrastructure that supports Biomedical initiatives. In addition, large-scale biomedical research applies methods that are continually evolving for analysis, arrangement, and description of research data products. The development of separate, project specific data management solutions can be either a barrier or a major resource that needs to be leveraged to achieve this integration. This proposal addresses critical milestones for integrating the separately evolving infrastructure silos to enable collaborative research. The traditional approach is to build a community resource that provides standard mechanisms for discovering data sets, accessing data, and manipulating data. Given the diversity of the types of data and information resources, multiple standards have been developed. From the perspective of a researcher who needs to work with community resources, a collaboration environment is needed that tracks the provenance of the data sources, manages the research analyses, and preserves the research products. The collaboration environment provides interoperability mechanisms that enable reproducible science, and the ability to register, execute, publish, and share workflows. The goal is to enable a collaborator to re-execute an analysis, test input parameters and model extensions, and compare the new research results with prior analyses. Data grids provide a layered architecture for integrating existing services and data repositories into a collaboration environment. Note that the focus is on the capabilities that a researcher will be able to apply, not on the development of a single community standard for accessing information or data. To verify that a collaboration environment can be implemented that enables access to existing capabilities, an extensible and sustainable layered architecture needs to be developed. A prototypical layered architecture is shown in Figure 2. Each layer provides mechanisms to support interoperability across multiple protocols (or APIs) used by existing biomedical infrastructure. A collaboration environment supports the registration of models, workflows, data sets, and metadata that are used in research. The collaboration environment uses computer actionable policies to automate administrative tasks (such as acquisition of data sets from a remote resource), enforce management policies (access controls, sharing, quotas), and validate assessment criteria (integrity, authenticity). Workflows may be used to define the set of analysis tasks used in the research, may be shared, and may be published. The remote resources are accessed through mechanisms that range from web services, to drivers that issue a specific storage protocol. Each access requires the choice of authentication, authorization, and transport mechanisms. The external resources are typically assembled by a community and may provide community models, or data repositories, or union catalogs for a specific sub-discipline. The layered architecture can be thought of as systems that support individual research collections, registration of objects, policy enforcement, access mechanisms, network protocols, resources, and federation. Each layer may require use of multiple protocols or interaction mechanisms. To enable shared collections that span multiple protocols, data grids support soft links that register remote data objects or information queries. Data grids also support registration of re-executable workflows. An interoperability layered architecture system will identify the existing infrastructure components, the types of interaction mechanisms currently being used, the mechanisms that need to be implemented to improve interoperability, and the opportunities for new technology integration. The exploration of existing interaction methods can answer questions related to: Identification of major community data resources Feasibility of a unified data discovery catalog Cross-community data exchange protocols Federation of repositories across multiple agencies and collaborations. Brokering of publishing, discovery, and access across disparate communities Development of common data access mechanisms Development of data assessment capabilities Integration of workflows with data repositories for remote data analysis operations Framework for analysis of experimental data Formation of shared environments for collaborative research Integration of analysis frameworks Decision support Virtual display environments The integration efforts for each level of the architecture will require development of a community consensus. For example, the data discovery effort will identify query languages for interoperability between repositories. The data assessment effort will identify gaps or irregularities in time series, user interventions, and inoperable sensors. The workflow model orchestration effort will identify architecture and guidelines for interfaces to perform integrated analysis and approaches for semantic mediation for matching analysis methods. An initial survey needs to be developed to identify existing capabilities, and opportunities for integration of technologies. A near-term project (6-12 months) can be funded for defining project architecture, prototype, and demonstration of end-to-end capabilities across existing systems. The project will document expectations, architecture, lessons learned, and gaps that require further development. A longer-term initiative (two years) is needed for providing a framework for integration with other communities, identifying emerging capabilities, and fully integrating capabilities. The interoperability layered architecture (LA) system will be driven by an end-to-end scenario for data discovery, data access, data retrieval, data analysis, and data publication. An example is the improvement of our ability to predict and respond to variability in genomic sequences while tracking efficacy of treatment procedures. The LA system is intended to enable inter-disciplinary collaborations that require interoperable data and model workflows, support for provenance, data quality propagation, and data assessment with the least burden on scientists. The LA system is also intended to enable inclusion of new technologies, cyber-architectures, storage, and distribution mechanisms. Use Case: The technology exists today to implement federation of data management systems across NIH research projects, across federal agencies, and across international collaborations. In particular, the NSF DataNet Federation Consortium cooperative agreement is building a federation hub at RENCI to integrate federal repositories (NOAA NCDC, NASA NCCS [12]), national data grids (Teragrid / XSEDE), regional data grids (RENCI data grid), and institutional repositories (Carolina Digital Repository, LifeTime Library). The approach relies on iRODS, an integrated Rule-oriented Data System (www.irods.org) , to federate independent data management systems into a common data name space. The architecture is shown in Figure 3. Peer-to-peer servers implement the major components of the system: storage servers, message server, catalog enabled server, scheduling server, and a workflow server. Each storage server manages conversion to the protocol required by the local storage system, and uses a local rule engine and rule base to apply policies at that server location. The message server is used to track progress and support distributed debugging of the distributed rule engine. The catalog-enabled server manages all state information and audit trails that track use of the environment. The scheduling server manages execution of deferred and periodic policies, and the workflow server manages interactions with external workflow systems. The levels of virtualization needed to manage data distributed across multiple administrative domains and across multiple types of storage systems are shown in Figure 4. The actions requested by each type of access client are trapped by policy enforcement points within the data grid middleware. For example, actions such as ingestion of a file into the data grid traverse ten policy enforcement points. These include policies for allowed access by a remote host, access permissions for public files, access permissions for private files, the server to use for storage, whether the storage quota is exceeded, the physical path name to use when writing the file, whether metadata should be modified before or after storage, and whether post processing is required after file creation or after the file write. The policies are applied by the rule engine and read from the local rule base. Each policy controls the execution of a procedure that is composed through the chaining of functional units called micro-services. The micro-services are operating system independent, and issue I/O operations based on an extension to Posix I/O. This implements infrastructure independence, with the same micro-services used on Unix and Windows operating system. The standard I/O protocol is then mapped to the protocol required by the choice of storage system. This approach makes it possible to add new clients, add new policy enforcement points, add new policies, add new procedures, and add new storage systems without modifying other components of the system. The procedures can be thought of as storage server-side workflows that are intended to implement low complexity operations (small number of operations compared to the size of the file in bytes). These workflows can be used to implement administrative tasks (file migration). They can also enforce data management policies (retention, disposition, distribution, time-dependent access controls) and validate assessment criteria (integrity, authenticity, chain of custody). Each community chooses a preferred client and selects the set of policies and procedures that are appropriate for their data management application. The forms of federation that can be supported include collections that contain soft links to remote information resources, or the mounting of remote directories into the data grid, or true federation in which two independent data grids establish a trust relationship, and users are cross-registered between the data grids. A user in Data Grid A can be given access to data within Data Grid B, while the policies of Data Grid B control what can be done. Each user is authenticated by their home data grid, while data management policies are enforced by the data grid in which the files are located. This makes it possible to apply multiple types of federation, from creation of central archives into which data grids deposit data for common management; master-slave data grids in which the slave data grids receive data from the master data grid; deep archives in which data are pulled from a staging data grid into the archive; and chained data grids through which data are replicated to ensure multiple copies. The integration of multiple, heterogeneous data management environments is now possible. The creation of national data infrastructure that links academic, institute, and federal repositories will accelerate research on societally important questions in biomedical science. Specific instances of institute use of policy-based data grids are the Wellcome Trust Sanger Institute (http://www.biomedcentral.com/1471-2105/12/361) and the Broad Institute (http://www.distributedbio.com/media/iRODS%20Bio-IT%20World%202011.pdf), both of which have constructed genomics data grids based on the iRODS technology.
03/12/2012 at 06:41:21 PM Organization DELSA (Data-Enabled Life Sciences Alliance International) Seattle Washington Please see the attached document. Please see the attached document. Please see the attached document. I. Introduction This response to the NIH Request for Information (RFI) is being submitted by the Data- Enabled Life Sciences Alliance International (DELSA). DELSA's purpose is to build and promote a sustainable ecosystem for life science researchers committed to collaboration across disciplines alongside innovators in computing, infrastructure and analysis with the expressed goal of translating new discoveries into tools, resources and products (http://www.DELSALL.org/). By taking a transdisciplinary approach - integrating experts across both academia and industry, life sciences and computing, cyberinfrastructure and analysis, policies and media - DELSA facilitates shared access to data and catalyzes the development of new insights and innovations that can more rapidly address global societal needs. The alliance builds on the collective abilities of its members including seasoned Executive In Residence members (EIRs) who apply real-life expertise to all facets of project management and also big data computing, management, and infrastructure experts who wrestle with the challenges of the data deluge every day. One of the most critical areas for NIH to address is large biomedical datasets. Specifically, the first and most critical step is achieving 'data democratization' through universal access to data and computing resources. Discovery is a matter of opportunity, and it is imperative that opportunity be extended to all. Just as the world has become a global economy, so too can and should science become a truly global endeavor if the resources needed to ask and answer questions are available to all, regardless of affiliation, location, or degree. All of the issues that NIH has identified are indeed critical, and DELSA members have given thought to sustainable solutions for the life science community. By developing policy, awareness, and infrastructure, NIH has the opportunity to implement many of these solutions effectively. II. Scope of the Challenges/Issues 1. Challenges and Opportunities of the Life Sciences Community DELSA has identified 4 key challenges and opportunities that must be addressed for effective 21st century life sciences (Kolker et al., 2012). 1. Life sciences research necessitates work across diverse domains. This is especially true amongst computer, cyberinfrastructure, and data experts in their effort to leverage opportunities in data-enabled science (DES). Straightforward, equal and sustainable access to data, computing and analysis resources will enable true democratization of research competitions. In this environment, investigators will compete based on the merits and broader impact of their ideas and approaches rather than on the scale of their institutional resources. Consequently, the progression of data to knowledge to action will be vastly accelerated, impacting every scientist, student and citizen. 2. Scientific progress and the accelerated rate of data production in the life sciences is resulting in a pressing need for validation and reproducibility of results. This need can only be met through new standards and data sharing capabilities. 3. Current funding structures and merit assessments appear unsuited for appropriate support of collective innovation and synergistic science. An examination of current funding initiatives is well-timed. 4. Data-enabled sciences will only succeed through the inter-agency collaborative efforts of those that have held and continue to hold American scientific progress as one of their foremost mandates, in particular NIH, NSF, DOD, and DOE. Collaborative approaches must also be emphasized within each agency, along with international and national efforts. Altogether these efforts will reduce waste and accelerate the process of research->discovery->translation- >implementation/delivery. 2. Unrealized Research Benefits 1. Data Democratization: The potential for unrealized research benefits is immense. Straightforward, equal and sustainable access to data, computing and analysis resources will enable true democratization of research competitions. In this environment, investigators will compete based on the merits and broader impact of their ideas and approaches rather than on the scale of their institutional resources. New ideas, instead of being thwarted by lack of resources, will be easily expanded or quickly abjured to make room for other approaches. 2. Effective use of Resources: A collective innovation approach to the life sciences will lead to more effective use of resources. Trusted data would be available widely without the need for duplicative experimentation, thus freeing resources for a further cycle of inquiry rather than a repeat cycle. A critical resource, experience in innovation, would be identified, and shared to support projects and to give advice about approaches and solutions. 3. Strategic Implications: This topic has very serious strategic implications. In a recent analysis of US science with a comparison to the EU and China, the US has, by most metrics, maintained its position of relative preeminence in the sciences (Hather et al., 2010). However, the inability to realize full potential must be addressed if the US wishes to stay at the top and continue enabling infrastructure science, sustainable knowledge-based advancement and innovative collaboration (Kolker, 2011, Ozdemir et al., 2011). Successfully meeting the challenges that face us will greatly support strong US science and innovation. 3. Tractability with Current Technology Data analysis is the final, most complex and compute-intensive step for the translation of large-scale data into knowledge-based innovations. The cost of computational analyses is projected to far exceed that of data generation, threatening current data mining infrastructures. Currently, research progress is severely impeded by heterogeneity of acquisition formats, lack of integration among commonly used tools and, most importantly, by the scale and computational challenges related to mining and analysis of these vast data sources. In-lab software development typically focuses on relatively specialized problems that make it difficult to scale-up or to transport the analyses to different environments. In-lab solutions are rarely shared across the community due to differences in data formats, lack of incentives, high development costs and an inability to provide for adequate support. Hence, there is a pressing need for adequate cyberinfrastructure that could consolidate computing and analytic resources, provide tools for exploration and analysis of large, heterogeneous data and, ultimately, allow the building of complex models of biological systems. Updating and upgrading the current cyberinfrastructure does not necessitate new technologies in the first steps of the project, but rather a realization that organization must be applied to the current chaos of data and computing resources. The answer may be a distributed computing paradigm, or in lay terms, a computing 'cloud.' For the general research community and bioinformatics in particular, a distributed computing paradigm (such as computation center clusters or clouds) can be the quantum leap to meet this crucial need, thereby improving research efficiency and enabling breakthroughs in data analysis and modeling. The transition from local computing environments to other technologies is a multifaceted technological and organizational challenge that demands thorough planning and oversight as well as long-term investments. The establishment and maintenance of the shared resources requires a centralized effort by the community. Budgeting for the compute centers (and the sustainability costs) can be shared by all stakeholders and realized via subscription services for academic institutions, governed access rights for industry, and designated budgets in biomedical grants issued by federal and private funding agencies. 4. Feasibility of Concrete Recommendations for NIH Action In May 2011 the research necessity of a transdisciplinary approach to data-enabled life sciences led to the proposal of the Data-Enabled Life Sciences Alliance (Kolker et al., 2012). Through the cooperative efforts of diverse domain members and generous funding support by the Moore Foundation the alliance members have had the opportunity to catalyze grant proposals that came about through the connections that DELSA provides. These grant proposals were just the start of the synergies that are forming through DELSA, and show that action can lead to results. DES challenges cannot be met with a one-step solution. However, as a starting step for data access improvements, it is feasible and necessary to survey scientists and develop multiple distributed data and meta-data repositories based on the determined needs. It is feasible to develop a community-wide effort to catalog and monitor core data resources-a wiki-style may be effective. An initial step for improved analysis access and deployment is to develop an Analysis Tool Shop for simplified, standardized and documented access to analysis tools (starting with Alignment, clustering and R tools). We can leverage and curate existing collections to make effective use of resources and eliminate duplication of efforts. A second yet equally important step is to provide a support team to maintain and troubleshoot these tools. An active community driven Shop will be the best approach. III. Standards Development 1. Data Standards, Reference Sets and Algorithms to Reduce the Storage of Redundant Data Research activity in bioinformatics is often characterized by huge amounts of highly heterogeneous data, dispersed in a large number of sources. This information should be explored by queries and analyses spanning a variety of biological data sources and allowing a biologist to gather relevant information, to formulate new hypotheses and, possibly, to validate them. All too often it is simply not possible to cross-reference the datasets, whether through mismatched identifiers, incompatible formats or lack of technology capable of handling the immense datasets of interest. It is ironic and lamentable that a few internet queries will find a long-lost high school friend or an out-of-print book, yet the data that may contain crucial pieces of a cure are buried in an unusable format on a dusty hard drive in a basement. Yet it is where one approach meets another, where one experiment sheds a glimmer of insight on some previously unremarkable result, that progress can be made. There is a need for better: i) schema integration, ii) schema mappings to navigate from one data source to another, iii) complex join across databases, iv) support for provenance data, and v) flexible resource discovery facilitated by a richer metadata registry. The first four requirements map directly to data integration functionalities. The remaining item reflects implicit needs for better metadata that will facilitate the selection and the location of distributed data resources. Current information system technology may provide some methods and tools to cope with this challenge. However, new and more effective techniques to explore a set of loosely integrated data sources have to be devised. IV. Data Accessibility 1. Central Repository of Research Data Appendices Linked to PubMed Publications and RePORTER Project Record Highly accessible publication and project data would streamline the process of research. It is extremely important when thinking of ways to approach a research problem that there be easy access to what has already been done in order to avoid duplication, learn from others' successes or failures and to have the broadest picture possible of the field of interest. This suggestion will allow that and will save effort and money by helping researchers stay connected with efforts related to the area in which they work. A flexible approach to proactive management is a federated network of partnerships that pulls together expertise and resources regardless of physical location. A successful example might be the Library of Congress, which has built a distributed network of partnerships to overcome challenges and take advantage of new opportunities and emerging technology (NDIIPP, 2010). V. Incentives for Data Sharing 1. Standards and Practices for Acknowledging the Use of Data in Publications Two broad standards and practices are recommended: Incentivize data sharing and revise data deposition policy to facilitate sharing process and tracking. At the current time there is no clear benefit (tenure, merit metrics) to incentivize a researcher to share data. Policies that support new indicators (e.g., bibliometric measures other than first or senior authored publications) of individual contributions to collective work need to be developed. Further, the federal funding data deposition policy, although requiring data deposition as part of publication, does not yet have a method to track the use of the dataset, nor a dedicated resource for sustaining access to the dataset after deposition. A system for dataset tracking and acknowledgement along with inclusion of metadata and provenance is needed. Such a system would give researchers a rich resource to evaluate for extant datasets BEFORE starting experiments of their own, therefore avoiding duplication of efforts and wasted research resources (money and time). 2. "Academic Royalties" for Data Sharing (e.g., special consideration during grant review) Developing an infrastructure that embraces sharing will enable new discovery through collective innovation. Yet credit toward tenure or funding must be given for development of tools and data sets that have value to the community, and resources must be in place to support sharing of those data sets and tools. In the presence of scarce resources, researchers by default compete with one another for those resources. Without appropriate incentives, researchers will, of necessity, limit sharing due to fear of being "scooped." Even if they do not intentionally limit data sharing, they may lack the time and resources to clean, format and host the dataset in a useful manner. Taken together, the reward systems in scientific practice should be appropriately modified for data-enabled science to move forward with genuine contributions both from individuals and transdisciplinary networks. The service component that is so essential to amass large data sets should be recognized with appropriate reward or incentive mechanisms - e.g., why not establish postdoc, graduate student or technician awards to recognize such service work that further creative analysis needs? Funding needs to be awarded, at least in part, based on the merit of proposed joint efforts rather than the number of citations a principal investigator may have. Transdisciplinary research needs to have a clearly established home, rather than exist as an oddity born out of the need for knowledge and expertise from many avenues. VI. Support Needs 1. Analytical and Computational Workforce Growth As the need for multidisciplinary teams grows, it has become obvious that the education, funding, and career development environment of science must adapt in order to attract and retain the best researchers in the data-enabled approaches. Young researchers need more training in the possibilities and potential of open source collaboration and collective innovation approaches (Ozdemir et al., 2011, Kolker et al., 2012). These new approaches hold great promise in enabling scientists to work together but require a shift in mindset from the onescientist, one-project approach so frequently taught. In addition, they must be shown that there are strong career trajectories that can involve large-scale data projects and collaborative teams. Future cyberinfrastructure and training must enable the groups to both educate and be educated about the high performance computing resources available. Scientist training must be updated to include expanded instruction in computer science, statistics and collaborative research. Educational portals should be developed to teach the lay community about the resources, and allow them to use the resources in a secure and understandable environment. Researchers from all fields would be able to take full advantage of data repositories, applications and high-speed computing resources without geographic restriction and in turn, they would add back to the current data repositories without difficulty. Bioinformaticians will be able to develop standardized tools and make them widely accessible, thus adding immensely to the rate of scientific progress. Software and hardware engineers will be able to develop, implement and improve a system that is standardized and thus easily understood and manipulated. A nation-wide integrated system with a common base user interface (UI) structure would allow people to learn new applications by recognizing interface mechanisms that were used in previous applications. This future system must: 1) allow access regardless of geographic location; 2) be flexible to allow innovation, change, and growth as needed; 3) foster integration and collaboration across agencies, offices and resources; and 4) encompass not only software but user/developer communities. 2. Funding for Tool Development, Maintenance and Support, and Algorithm Development Tools and support for large datasets and general cyberinfrastructure needs government impetus. A hallmark of data-enabled science is that it amasses data that is public good (i.e., creates a "commons") that can further be creatively mined for various applications in different sectors. In these times of severe budget cuts, a data access solution would provide added value for every funding dollar as data collected in one lab can be used by many others (Hather, et al., 2010, Kolker, 2010). Increased efficiencies in data usage would lead to more effective use of funding resources and cost savings that could be in turn applied to the cost of sustainable tools and dataset support. Most research funding agencies maintain a short-term approach that may or may not sustain the translation of data-enabled science to tangible value added products. It is imperative that support be envisioned to allow access to datasets and tools after the initial grant has ended-what is needed is a long-term science policy vision. VII. New Topics 1. Science Communication and Representation in Society Representation in society is particularly important with regard to society's view of us and our work. Scientists as a whole have a continuing challenge to make science accessible, understandable and, therefore, non-threatening. This is not a new issue, and is not particular to data-enabled science but rather all the sciences need to have positive representation to society, so in turn society will know their value and support them. Certain efforts within the life sciences that involve laypersons, such as volunteer clinical trials, can contribute to data-enabled science. As data security is an increasingly important topic, we need to make sure that the layperson feels safe to contribute to data-gathering efforts and that they feel it is valuable to contribute to an effort. Many want to track their data and get results back from studies and we need to do our best to enable that involvement without compromising privacy. A new and intriguing initiative, "That's My Data!", run by Dr. Stephen Friend and Ms. Sharon Terry, is working to establish a process in which interested volunteers can donate samples and allow the data from the samples to be used in exchange for open access to the data (Marcus, 2011). Citizen science may also be an avenue for societal involvement as well as social network venues. Tools such as Facebook and Wikipedia, and social media such as podcasts may be methods to put a face on science. Wikipedia is perhaps one of the most astounding examples of societal involvement, and is a tool that is heavily used by the public. There are five main constituencies that must communicate, both within and between themselves: 1. Scientists & Researchers 2. Funders & Policymakers 3. Students 4. General Public 5. Industry Education is the key to communication. The more the individual groups learn about each other, the larger the shared vocabulary. This shared language will in turn lead to clearer exchanges of ideas and a higher comfort level as the unfathomable becomes the understandable. 2. Sustainable Systems The data issues we are grappling with are only the start of the challenges we face. The life sciences continue to evolve in ways we cannot predict, yet we can be sure they will involve ever more intricate and sizeable datasets, and ever and more fascinating partnerships for analysis. For this, we need to think beyond our current challenges to make sure we build for the future. It is necessary to plan with an open mind and to realize that new data types, new interconnections and new technologies will continue to surface. The past 20 years has seen an immense change from a lab book on every desk to a computer on every desk; the next 20 years may be equally revolutionary. Although we are envisioning a sustainable network of communities and expertise, with a global access portal to data and analysis tools for every level of scientist from large center to citizen, we may find that the reality of what we envision is no longer the best fit for the challenges of the future. For that eventuality, DELSA has the experienced entrepreneur and the seasoned businessman scientist to advise on flexibility and adaptability within a changing landscape. VIII. References Hather, G, Haynes, W, Higdon, R, et. al. The United States of America and scientific research. PLoS One. 5(8): e12203, 2010. Kolker, E. A vision for 21st century U.S. Policy to support sustainable advancement of scientific discovery and technological innovation. OMICS, 14(4): 333-5, 2010. Kolker, E. Special issue on data-intensive science. OMICS, 15(4): 197-8, 2011. Kolker, E, Stewart, E, and V. Ozdemir. Opportunities and Challenges for the Life Sciences Community. OMICS, 16(3): 138-147, 2012. Marcus, AD. Citizen Scientists (2011). The Wall Street Journal: The Saturday Essay. December 3. Ozdemir, V, Rosenblatt, DS, Warnich, L, et. al. Towards an Ecology of Collective Innovation: Human Variome Project (HVP), Rare Disease Consortium for Autosomal Loci (RaDiCAL) and Data-Enabled Life Sciences Alliance (DELSA). Curr Pharmacogenomics Person Med. 9(4): 243-251, 2011. Preserving Our Digital Heritage: The National Digital Information Infrastructure and Preservation Program [NDIIPP] 2010 Report.
03/12/2012 at 08:21:13 PM Organization Neuroscience Information Framework La Jolla, CA While many responses and recent colorful editorials bemoan the deluge of data, the tidal wave of data, and looming data problems we would like to take a less apocalyptic approach to our data challenges in this response. The NIF (Neuroscience Information Framework project; http://neuinfo.org) and the group that runs it, has been successfully dealing with large data, distributed query and distributed storage architectures for over 15 years and we can tell the committee that this challenge is both tractable and should not require an investment equal to the collective output of the world's economy. With a few notable exceptions, scientists have not had the privilege of looking and comparing primary data directly and in perpetuity until very recently, and yet many very interesting and useful things have been created over the last 200 years. The question whether open and shared data will improve the intellectual output of science is a good one and we appreciate the opportunity to weigh in on the issue.

In considering the scope of data challenges, we would like to consider the types of data first to clarify subsequent points of this response. Data and information can be broken down into a few major categories; each should be considered data, yet each has a different data storage requirement. The so called raw data is comprised largely of codes, numbers, images, movies and/or raw instrument output. This data can be storage intensive because each image stack from one of the newer microscopes can be as large as a terabyte. The next layer of data is the transformed, normalized, analyzed layer of data that goes into the figures of the paper. This sort of data is typically less storage intensive because it tends to be governed more or less by what journals will take in terms of file size, but may be a good place to start thinking about data storage. The last layer of data tends to be what is said about the data, which is all that science has ever really known.

The institutional libraries have taken the long view and have saved the intellectual output of research (the published paper) in very neatly bound journals in various library basements. However, as these basements fill with the intellectual output of institutions it should be noted that digitalization of this intellectual output is continuing by not only institutional libraries, but also by for profit companies including Google who see that there may be significant value that can be derived from old data. Indeed, digitized data can be searched with unprecedented efficiency and one of the first and perhaps most important policy recommendations is to continue to both digitize published work and to grant the same access to the data to search systems as is enjoyed by humans. PubMed Central is becoming a vast repository of open access publications, which means that humans can ask for things that would be stated in the methods sections of papers such as have you used male or female subjects in your experiment (http://blog.neuinfo.org/wp-content/uploads/2012/02/The-Representation-of-Gender-in-NIFs-Data-Holdings3.pdf), not just the abstracts. But the subset of publications that is open access for general text mining (deploying automated search agents on the contents) remains quite limited. Most publishers, even non-profit society publishers, have strong legal language that prevents automated agents from reading the papers thereby restricting the flow and mining of publicly funded information. Preventing the flow of this information reduces its liquidity and stops work that could potentially be useful. Changing this practice of locking away public statements about data should be the first, and possibly most critical issue to address.

It is wonderful that NIH, as an outgrowth of the US taxpayer, is considering the value of data and meta-data and not just the published statements about the data as important artifacts worthy of preservation. The value of data is not likely to be calculable at this point for several reasons. First, the liquidity and availability of data is still quite low. When data are locked away in the supplementary figures or on researchers web sites they are not optimally discoverable and therefore they are not optimally reusable. Second, we simply do not know how much of the data can be reused and what benefits we may draw from them; however we do know how much it costs to recreate a study (Turner et al., 2011, NIDDK paper). In clinical trials, an individual study can cost tens of millions of dollars, and it is also possible that the clinical trial will exhaust a cohort of subjects especially in rare genetic diseases such as spinal muscular atrophy, making replication of a study impossible. Animal work tends to be much cheaper, but there is a significant cost of reagents, researchers time, and the lives of animals. Furthermore, extinction in either real terms or of the species from the laboratory makes some data exceedingly valuable even when the techniques are no longer the latest and highest resolution available. These considerations should augment all discussion of cost associated with the storage of research data.

It is very true that not all data are equally valuable, but we find ourselves in an unprecedented time where saving the raw data, not just the most valuable, but most data is possible and relatively inexpensive. Therefore, it should be feasible to create some policies for data sharing, data retirement, and data archiving. The computer scientists may argue that data can be placed into a phased retirement program that would keep the newest data in the most immediate memory storage, and older data would move slowly to remote storage. This solution would allow some augmentation to retirement based on how often the data were accessed by the community.

The practical approach that NIF has taken to address this problem of storing data has been that there must be data triage, as data comes in; it is assigned to one of a few categories and treated appropriately. The data that are important to understanding a particular paper should be described with all critical methods, assigned an identifier that can uniquely identify the data so that people reading the paper in some years can be appropriately directed to the place on the web that the data reside even when the researcher moves from one university to another. The NIF maintains a registry of web accessible output of government funded activity and this currently holds a catalog entry to over 4700 items, including over 2000 databases (http://neuinfo.org/nif/nifgwt.html?query=*&;tab=registry). These entries are automatically checked for changes and those that have been significantly changed are flagged for review. Indeed this process of adding a dataset to the NIF or another such project takes up to about 30 minutes initially and requires little curator time. NIF and other projects such as Dryad also provide a shared space where these data can be stored and downloaded, but that is the extent of treatment of data unless the data are of general use to the scientific community. Data that is seen as being of general use, such as a large microarray data set or a confocal stack of microscopic images is directed for upload to a database for that type of data where data can be accessed using custom tools such as viewers or easily re-analyzed. These community databases usually assign identifiers to the data set that can be used in the paper, they structure the data and allow for annotations specific to the type of data (for example, Gene Ontology annotations for genetic data). The additional benefit from supplying these data to a domain database is that it can be compared to other experimental results of the same type and can be re-analyzed more easily. There are currently 312 "data storage repositories" listed in the NIF registry, which are databases that accept data of various sorts. We feel that any system that is an outgrowth of the NIH should take advantage of the vast resources that various NIH and NSF institutes have already invested in the creation of these community databases.

It should be noted that even though there are over 2000 databases, searching across these may not be as time consuming as we may imagine. Certainly going to 2000 websites and asking all of them whether they have information about a specific gene is not practical, however asking all community databases to make their data available in an open format will allow the development of tools on top of these data that will do the equivalent of a simultaneous search. Although we are not anywhere near 2000, over the last 4 years with 2 curators and several developers, we have created a search system that currently accesses over 150 scientific databases simultaneously at sub-second time scales (http://neuinfo.org/nif/nifgwt.html?query=*).

The development of standards has been incredibly useful in many areas of human interaction because it creates a playing field with set rules, but where the goal of activities is innovation, standards should not be taken too far. It is not the case that all things about research can be standardized, because at that point, the activity will no longer be research. Therefore adhering to standards should be rational, in that data that is of use to a paper should be described in sufficient detail to understand the claims in the paper, whereas data that is for general use should be standardized to a greater degree. For the first case, the NIF participating in the LAMHDI (linking animal models to human disease) working group have reached a consensus that a small set of parameters about all biological experiments can be expected to be added to all publications and should be generally acceptable to the scientific community. These standards for all published work, should not be burdensome to researchers and they should be immediately useful. For example, in each study the researchers should report the catalog numbers for all reagents and tools (note, in the Journal of Neuroscience less than 10% of papers include the catalog number of antibodies used) and the subject of the study should be clearly specified. We believe that asking researchers to do these simple things should improve dramatically the ability to reproduce data. It may be possible to add these requirements to all PubMed Central submissions therefore drive the addition of this critical information to all publications. This would allow questions such as "who else used this antibody in a mouse brain preparation?" to be asked.

While data analyzed in a paper may not always be able to be cited in that paper, depending on the policy of the journal, analysis of data subsequent to publication should be added to the publication. The NCBI has one mechanism called the LinkOut for linking data to PubMed publications and the NIF has thus far contributed over 170,000 links to publications via a service known as the LinkOut Broker. These links between data and publications can be made from any element in a NIF database, such as the Brede Database, to any PubMed paper. For an example, please see the following article's LinkOut section, which displays links to individual activation foci mapped to a brain map in the Brede database: http://www.ncbi.nlm.nih.gov/pubmed/15812326.

This concrete example above, where there was an explicit link between data and publications is one way that data can be linked to publications, and publications can be linked to RePORTer data. Indeed, there has been a significant effort to have researchers specify the grant numbers in all publications. Because each journal has a different set of policies regarding submissions and all communities are a little different, these efforts of asking the researchers to specify grant numbers will likely always be somewhat incomplete, however there are statistical methods to link data and publications to grants from which they were likely to be derived. I will not get into details of the probabilistic plausible retrieval model here, but resources such as vivo and researcher's web pages at their universities can indeed be used to predict links between data, articles and grants.

The question of incentives for data sharing, and appropriately linking to data have been discussed in great depth at various venues including the Beyond the PDF meeting in February of 2011. However, as complicated as all the solutions could be, in almost all cases the researchers seem to agree that when the NIH requires this, they will absolutely do it. We understand that the NIH is not a regulatory body, but data sharing statements are being required for all grants over $500,000. These statements from previous grants are not available to reviewers of subsequent grants so there is no way to evaluate whether or not they were adhered to. It would potentially be a relatively simple change to allow study sections to have access to this information in evaluating funding.

Costs of data sharing are not simple storage of data in a cloud infrastructure. Maintenance of data, a system for searching the data, curators for linking data to publications, are all important costs associated with data. However, the NIH and NSF has already invested significant amounts of money in existing systems and it would be a mistake to fund the creation of an entirely new system or systems. Community databases, that are already funded, should be able to absorb a great deal of data. However, some community databases lose funding and any data that has been gathered therefore can be at risk. It is important therefore to consider a 'back stop' system that can take and preserve the data at a tiny fraction of the cost of maintaining the community database, web portal and associated system. We would like for the committee to consider the substantial investment in NIF as a mechanism that is capable of taking data from any database, creating a local copy and maintaining the data in perpetuity. We cannot save all the software applications, but we can make them available through our general interface, so at least they are available.

We believe that in this time of rapidly changing technologies, that advances in search and data mining will eventually be able to extract meaningful signals from raw data. In the meantime, we should make sure that all data are potentially available for reuse as reasonably as is possible. We believe that through projects like NIF, eagle-I and myriads of others, the basic strategies and infrastructure are within reach.

The issues touched upon in the above comment, in order of importance are as follows: 1. Open access publishing, specifically that access through institutional libraries and PubMed Central is guaranteed to only humans and not to automated agents prevents us from reaping rewards from older research. 2. Authors are very inconsistent in describing methods and asking them to add catalog numbers to all papers will make replicating studies a significantly simpler task. 3. There has been a substantial investment in infrastructure, both for research communities and for search across these. Please consider this investment in shared infrastructure as an asset that should be built onto and not replicated. 4. Compliance with data sharing statements will only be achieved without review committees being able to read and evaluate previous data sharing statements. -see comment 2-  
03/12/2012 at 09:04:04 PM Self     While there are many challenges in leveraging the true value of research data, we think **the greatest risk is inaction**. We currently tolerate a high level of wasted investment in data from Federally Funded Scientific Research that cannot be verified or reused, and are paying tangible opportunity costs as a result.

Leaving behind us what has been called the "digital dark age" (Kuny 1998) as it applies to research outputs should be one of the top priorities for the US science policy.

Please see the attachment for our full response.

Please see the attachment. Please see the attachment. Response based heavily on: Vision TJ, Piwowar HA (2011) Next steps for digital data from federally funded research. http://bit.ly/sPyZrz (available for redistribution and reuse under the terms of a CC-BY 3.0 License) While there are many challenges in leveraging the true value of research data, we think the greatest risk is inaction. We currently tolerate a high level of wasted investment in data from Federally Funded Scientific Research (FFSR) that cannot be verified or reused, and are paying tangible opportunity costs as a result. Leaving behind us what has been called the "digital dark age" (Kuny 1998) as it applies to research outputs should be one of the top priorities for the US science policy. We are responding as individual scientists, though we are affiliated with data archiving initiatives in the biosciences, namely the Dryad Digital Repository (http://datadryad.org) and the Data Observation Network for Earth (DataONE, http://dataone.org), and have been active for a number of years in both research and implementation aspects of data archiving. Our response will be primarily concerned with basic research data, which is where we have relevant experience. Recommended Policy Data archiving mandate. Our first recommendation is based on the experience of Dryad, which grew out of a grassroots initiative among a number of biology journal editors to craft a Joint Data Archiving Policy (JDAP) and to ensure that a suitable repository infrastructure existed to support the specifics of the policy before it came into effect (Moore et al. 2010). Thus, a policy mandate acceptable to the community came first and the repository infrastructure to support that specific policy was developed as a result. We recommend a similar approach at the federal agency level, first applying a strong and common data archiving policy for data arising from basic research funding investments across all agencies, and promoting the development of technical solutions to support the policy as a second step. We offer a template for such a policy as it pertains to data associated with publications: "This agency requires, as a condition for funding, that data supporting the results in research publications must be archived in an approved public repository. Grantees may elect to have data publicly available either prior to or at the time of publication, or may opt to embargo access to the data for a period up to a year after publication. Exceptions to this policy may be granted if justified in the Data Management Plan for data that meet certain exemption criteria [to be enumerated for each agency or program]." We feel that a policy of this nature is broadly applicable across agencies and disciplines, which are free to make specific guidelines regarding suitable repositories, what qualifies as an exemption criterion, and the presence/length of the embargo period. Timely Archiving. It is important that data be archived in a trusted repository at the time the research concludes (in the case of JDAP, at the time of publication), rather than shared upon request after the fact. Multiple studies have shown that disseminating data upon request does not work: researchers or data can't be found, investigators share data selectively with certain colleagues, impose unreasonable conditions on reuse, and are more likely to decline requests when there are quality issues with the data or analysis in question (Campbell 2000, Wicherts et al. 2011). Repositories can offer limited-term embargoes on data release (as discussed above) in order to protect researchers from competitive pressure where this is deemed appropriate. It is important that embargoes not be longer than necessary. We have observed that investigators publish almost all associated papers within two years of archiving their data; in contrast, published reuses of data by third party investigators continue to accumulate for years beyond that timeframe (Piwowar 2011c). Repository Oversight. To ensure the responsible stewardship of public assets, federal agencies should coordinate policy regarding certification of trusted repositories. This would help ensure that repositories meet agency expectations for preservation processes, metadata standards, governance, financial sustainability, and so on. One lightweight model for such certification is the Data Seal of Approval (http://www.datasealofapproval.org/). Peer Review for Data. For data associated with publications, an increasing number of journals require that data be made available at the time of peer review (for instance, those published by the Public Library of Science). This is a useful model for funders to promote, in that enlists expert reviewers and editors in ensuring data availability and re-usability. Funders (and research institutions) pay considerable sums for the service of peer review provided by publishers, and so have the right to expect a high level of service from it. The capacity to support secure, anonymous access of peer reviewers to data may be included among the expectations for trusted repositories of publication-related data. Recognizing the Scientific Impact of Data. The research community need to be confident that publicly archived datasets are valued as first-class scholarly objects by funders and grant reviewers. Specifically, producing a highly valued dataset should contribute more to success in obtaining future funding than producing an insignificant article. We recommend the following: 1. Federal agencies should explicitly encourage the inclusion of publicly archived datasets in the credentials of grant applicants. As an example of current practice, instructions for NSF biosketches mention only that that "patents, copyrights, and software systems developed may be substituted for publications" and that Synergistic Activities may include the "development of databases to support research and education" [GPG Chapter II]. These guidelines inadvertantly imply that datasets are not scholarly products of value. 2. Agencies should systematically collect information on the datasets that have been produced by each grant through the annual and final report mechanisms. 3. Funding agencies should work to promote the infrastructure needed to support impact tracking of datasets (see Question 7). For instance, funders may require the assignment of DataCite IDs (http://datacite.org) as part of the certification criteria for a trusted data repository. Take Data Management Plans Seriously. We recommend that all federal agencies ensure that data management plans are rigorously reviewed during evaluation grant proposals, and ensure that grant budgets include funds for the execution of the plan. Following the lead of funders such as the Wellcome Trust and the UK Research Councils, US federal agencies should issue a common statement that the costs for curation, preservation and access of research data are integral to the costs of doing research, and thus must be explicitly budgeted. Filling Repository Gaps. Many disciplines lack appropriate repositories for data, code, mathematical models and other digital research outputs. Research funders should provide seed funding for such infrastructure. Funders should ensure that new infrastructure efforts are not chosen on the basis of technical innovation alone, but will have the capacity to be trustworthy stewards of public assets. Research For More Effective Research. The effectiveness of data policy and infrastructure must be systematically monitored so that future decisions may be informed by evidence. Federal agencies should issue specific solicitations to researchers to collect the relevant, actionable evidence they need to make such decisions. IP We agree with a recent response from Cameron Neylon: intellectual property is to be treated as of as a means of incentivizing investments in research rather than as an end in itself (Neylon 2012). FFSR may at times require access to data that is confidential for legitimate commercial reasons (being trade secrets or of relevance to an undisclosed patent), but agencies supporting basic science should not fund the original acquisition of such data, except - under current law - as it relates to an invention under the terms of the Bayh-Dole Act. Other data are kept confidential for reasons of national security, protection of personal privacy, or protection of sensitive assets (endangered species, cultural artifacts) and may legitimately be produced with federal funds. Protecting the confidentiality of data for commercial exploitation should require significant value-added investment. This is consistent with the position of the International Association of Science, Technical and Medical Publishers in the so-called Brussels Declaration, which states that "raw research data should be made freely available to all researchers" (STM 2007). In the absence of the above reasons for confidentiality, intellectual property policy should protect the driving incentive for ongoing research, which is the availability of public funds for the conduct of science. Researchers and universities do not require further IP as incentive to conduct FFSR, as it is already the nature of what they do. Rather, the continuation of generous public support for FFSR is endangered by policies that allow researchers, universities, publishers or others to place unnecessary restrictions on the exploitation of outputs from public investments in research. Thus, it is in the interests of maintaining a healthly FFSR enterprise, and the corresponding commercial innovation sector that it spawns, that federal agencies ensure restrictions not be imposed where they serve no legitimate public purpose. Furthermore, since most scientific data, being facts and not creative works, are generally not subject to copyright, a Creative Commons Zero waiver is the most suitable instrument for providing clear and nonrestrictive terms of reuse for data. (See Question 9 for a discussion of rewarding credit to data authors). Funders should not permit restrictions on commercial use or derivative works for the outputs of FFSR, as such restrictions stifle innovation without providing incentive for research investment. Disciplinary differences Every discipline perceives itself to be unique. However, it is appropriate for federal agencies to articulate strong general principles and policies with regard to the management and dissemination of research data, while allowing for discipline-specific implementations that are sensitive to inherent differences such as data volume, machine format, complexity of human curation, long-term value, the applicability of particular metadata standards, etc. In truth, many of the sociotechnical challenges in data management, standardization and dissemination are shared across disciplines, particularly for the high-value portion comprising the "long-tail" (Heidorn 2008) of "small science" (Onsrud and Campbell 2007) data associated with publications. A strong interdisciplinary 'information community' (in the parlance of the IDWGG) of data librarians, data scientists and educators should be cultivated. Development of such a workforce should be modeled on exemplar efforts such as the NSF DataNets, the Digital Curation Center in the UK, and the Australian National Data Service. This community is needed to help shape and support general policy and infrastructure within and among agencies, and to help spread data expertise into the educational and research communities. At the same time, grass-roots 'communities of practice', sensu IDWGG, must engage disciplinary scientists in order to determine how to implement general agency policies. Such communities would be in the best position to develop the discipline-specific standards that govern the reporting of data, as well as other research products (e.g., software code). Individual disciplines and communities may wish to opt-out of general policies (e.g., data archiving). This should be permitted only where the community makes a strong public case that the principles and goals are not applicable to their area, or that the same goals may be effectively achieved in a different way. Funding agencies are the only stakeholder that can be relied upon to speak for the public interest in the dissemination of data from FFSR when it is in conflict with the short-term competitive interests of other stakeholders in the research enterprise, and taxpayers expect their government to exercise that responsibility. Costs and curation It is frequently impossible to accurately determine the reuse value of a dataset at the time of initial reporting. Many reuses -- indeed, perhaps the most valuable ones -- are for unanticipated applications. Furthermore, we have seen in biomedical data archives that data reuses are not confined to just a few "hot" datasets but spread broadly among them (Piwowar et al. 2011). Best-practice data archiving is less expensive than many assume. On the basis of our projections for Dryad, the marginal cost of data publication is only a small fraction (< 2%) of the cost of scientific article publication (Beagrie et al. 2010a, Vision 2010). For Dryad, it turns out to be much less expensive to accept all the data deposited, and to hold it indefinitely, than to make decisions regarding what to ingest or remove. By comparing the number of published articles generated by a typical grant with that enabled by typical patterns of data reuse, we have found that the modest amount of funding needed to maintain a repository like Dryad is almost certain to generate a comparatively large scientific return on investment (Piwowar et al. 2011b). Curation at the time of ingest is a much more significant expense for many repositories than long term storage (Beagrie et al. 2010a) and much of the most valuable data (e.g., that associated with publications) is relatively small. For example, the average dataset in Dryad is less than 5MB in size. Furthermore, cost-effective models for the publication of very large datasets are emerging, such as the recently launched BMC journal GigaScience (http://www.gigasciencejournal.com). Finally, the burden of archiving on individual investigators should not be overestimated. Although new practices invariably generate anxiety, Whitlock (2010) and others have demonstrated that basic guidelines for good data archiving and reuse can be made simple and intuitive. Data management plans While many stakeholders must play a role, we wish to emphasize the crucial role of funders in monitoring the effectiveness of agency policy and individual adherence to data management policy.. Funders should recognize the depth of the need to raise awareness about expectations regarding data management. As part of an ongoing study (Piwowar 2010b), we asked corresponding authors of biology articles about their funders' policies on data archiving: 27% of the investigators responded that they didn't know if their funder had a data archiving policy (n=1500; 39% said their funder had no policy, 10% said their funder required online public archiving). At the same time, consistent with other studies, respondents overwhelmingly believed that mandatory public online data archiving is the "right thing to do." It appears that funders are missing an opportunity to reinforce the best instincts of their funded researchers. Funding A variety of funding mechanisms will be needed to provision support for data services (curation, dissemination, migration, replication, etc.), given the heterogeneity of all FFSR data. The desired model specifically for long-tail small-science data will (a) provide for some direct investments in repository research and development, (b) scale with the volume of service provided, (c) facilitate the operation of an efficient market for data services, and (d) enable investments in shared international infrastructure. Investment in repository infrastructure. There will be an ongoing need for direct investment both to support research and development needs of existing repositories and to fuel the development of new resources for datatypes or disciplines lacking existing solutions. When it is necessary for the funding model of data services to be dependent on grants, these should be evaluated based on criteria relevant to infrastructure, rather than solely innovation. Scalability. Scalability of finances for data services can be achieved by including the costs for data management within research budgets, and allowing individual awardees to direct those costs as needed for their project. Market for data services. Similarly, if funds are allocated to services on a project-by-project basis, that establishes a competitive market for data services within which those of greatest value receive the most support. International coordination. Insofar as direct funding from agencies is required for certain datatypes, the greatest challenge will be to develop mechanisms for multinationational investment in shared resources (e.g., such as that used by ELIXIR, http://www.elixireurope. org/). While the costs of supporting data infrastructure are tangible, funding agencies should also attempt to understand the hidden economic costs of not having infrastructure to support investments in FFSR data, so that the cost and benefits of investment can be fairly compared. Compliance Evaluating plans and tracking compliance are important. Evidence suggests the NIH requirement for data management plans is generally considered toothless (Tucker 2009) and has made little difference to data availability (Piwowar et al. 2010, Piwowar 2011f). Disseminating research results is the responsibility of the funded researcher. In the short term, better mechanisms for tying outputs to funding are required. As mentioned in response to Question 1, we recommend that annual reporting require researchers to list publicly available datasets derived from FFSR. Research results that have not been disseminated in accordance with policy should not be acknowledged as output of the grant for the purposes of evaluation. Federal agencies should enthusiastically collaborate with publishers, libraries, universities, and other stakeholders in promoting technological solutions that will promote trackability of research data products and the reuse of those products, such as DataCite (http://datacite.org), ORCID (http://orcid.org) and VIVO (http://www.vivoweb.org). In the longer term, there is great potential in moving beyond compliance monitoring to fostering enthusiastic reporting through incentives. The impact of both traditional and non-traditional research products (articles, datasets, code, blogs, preprints, slidedecks, etc.) can be collected for investigators, research groups, institutions, grants, and even whole grant programs using traditional and non-traditional metrics (citations, views, downloads, bookmarks, tweets, etc.). These statistics can then be used to demonstrate the impact of individuals and organizations during evaluations, providing an incentive for products other than only publications to be reported. We have been working, with others, on a prototype project to demonstrate this potential (http://total.impact.org); an example showing the "impact report" for one of us (HP), including download metrics for archived datasets, is shown here: http://totalimpact. org/report.php?id=SIIysw Achieving compliance through incentives is currently hampered by our closed scholarly communication infrastructure. Existing citation indexing systems do not index datasets -- even when cited in a paper's reference list (Piwowar 2011d), do not make citation data available for innovative impact mashups, and can not be improved through open source contributions. Barriers to text mining the scientific literature are also significant because the context of a citation contains important information about the nature of the attribution. Future funder initiatives could help address these barriers. Data Licenses As we mentioned in Question 7, opening up our scholarly communication infrastructure would make the output of funding more generative - more able to produce innovation (Zittrain 2008). We recommend licence terms for data or other research outputs that do not exclude commercial and derivative products; this will ensure that outputs from FFSR are available for innovative scientific applications and the creation of new business opportunities. Specifically, nonrestrictive access to all research outputs (papers, data, code, etc) would permit machine access, text- and data-mining, data integration, third-party curation, and other value-added services. Complementary research outputs We recommend that access and preservation of software from FFSR be given the same policy attention as data. Almost all digital data is collected, and statistics are computed, through the execution of software code. Access to the code associated with a dataset increases the comprehension, re-usability, and replicability of that dataset and its analysis. The accessibility of the scientific literature is also key to fully leveraging associated datasets. The most valuable piece of metadata about a dataset is the publication that describes its original collection and analysis. When this metadata is not available without restrictions on copying and reuse, it limits the reusability of that dataset. Data Citation Citation formatting flavors notwithstanding, the scholarly community has fairly efficient and effective norms for citing published papers. Because community norms for citing datasets have been lacking, investigators have adopted a variety of conventions for providing attribution to the authors of datasets (Weber et al. 2010, Enrique et. al 2010, Piwowar et al 2011e). Few stakeholders provide guidance on data citation (Weber et al. 2010); journals, unsurprisingly, are leading the way whereas funders have provided very little guidance thusfar. This diversity of citation practice makes it difficult to track data reuse. Nonetheless, even in the current chaotic environment, investigators receive benefit for archiving data. Several analyses, in diverse disciplines, have found that studies which make their data publicly available receive more citations than similar studies which keep their data private (e.g. Piwowar et al. 2007). In an survey of 1500 corresponding authors in biology, 45% of authors reported that their datasets have been used and formally cited; only 21% said their datasets had been used without citation (Piwowar 2010b). Data standards To help guide public investment in standards efforts, we recommend federal agencies encourage research into the economic tradeoffs inherent in standard development. Standards have benefits in ease of data reuse, but also incur costs in development, maintenance, and compliance. We need to understand better how to balance these costs and benefits. References - Beagrie N, Eakin-Richards L, Vision TJ (2010a). Business Models and Cost Estimation: Dryad Repository Case Study, Proceedings of the 7th International Conference on Preservation of Digital Objects (iPres) 365-370. - Beagrie N, Lavoie B, Woollard M (2010b), Keeping Research Data Safe 2, Final Report - Campbell E (2000) Data withholding in academic medicine: characteristics of faculty denied access to research results and biomaterials. Research Policy 29, 303-312. - Enriquez V, Judson S, Weber N, Allard S, Cook R, Piwowar H, Sandusky R, Vision TJ, Wilson B (2010 Data citation in the wild. Nature Precedings. - Heidorn, PB (2008) Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2). - Kuny T (1998) A Digital Dark Ages? Challenges in the Preservation of Electronic Information. International Preservation News, No. 17. - Michener W, Vieglais D, Vision TJ, Kunze J, Cruse P, Janeé G (2011) DataONE: Data Observation Network for Earth - Preserving Data and Enabling Innovation in the Biological and Environmental Sciences. D-Lib, doi:10.1045/january2011-michener - Moore AJ, McPeek MA, Rausher MD, Rieseberg L, Whitlock MC (2010) The need for archiving data in evolutionary biology. J Evol Biol. 23, 659-60. - National Science Foundation, Grant Proposal Guide, Chapter II - Neylon C (2012) Response to Request for Information - FR Doc. 2011-2862 - Onsrud, HJ, Campbell J (2007) Big Opportunities in Access to "Small Science" Data, CODATA Data Science Journal 6, OD58-OD66. - Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. - Piwowar H, Chapman W (2010) Public sharing of research datasets: A pilot study of associations. Journal of Informetrics 4, 148-156. - Piwowar H (2010b). Study on Impact Of Journal Data Policies: Towards understanding the impact of journal data archiving policies on attitudes, experiences, and practices of authors. Recruitement ongoing. - Piwowar H, Vision TJ, Whitlock M (2011b) Data archiving is a good investment. Nature 473, 285. - Piwowar H (2011c) A New Task For NSF Reviewers: Recognizing The Value Of Data Reuse. Research Remix blog. - Piwowar H (2011d) Tracking Dataset Citations Using Common Citation Tracking Tools Doesn't Work. Research Remix blog. - Piwowar H, Carlson J, Vision TJ (2011e). Beginning to Track 1000 Datasets from Public Repositories into the Published Literature. ASIS&T 2011. - Piwowar H (2011f) Who shares? Who doesn't? Bibliometric factors associated with open data archiving. PLoS ONE 6(7): e18657. - STM, (2007) Brussels Declaration on STM Publishing http://www.stmassoc. org/public_affairs_brussels_declaration.php (STM 2007) http://ftp3.dnssystems. net/~stm/2007_11_01_Brussels_Declaration.pdf?PHPSESSID=5cd58816ea0a 9087be865c6cf046626f - STM (2007) Brussels Declaration on STM Publishing - Tucker J. (2009) Motivating Subjects: Data Sharing in Cancer Research. PhD Dissertation, Science and Technology Studies, Virginia Tech. - IWGDD (2009) Harnessing the Power of Digital Data for Science and Society. - Vision TJ (2010) Open Data and the Social Contract of Scientific Publishing. BioScience, 60(5):330-330. - Weber N, Piwowar H, Vision TJ (2010) Evaluating Data Citation and Sharing Policies in the Earth Sciences. ASIS&T 2010. - Whitlock M (2011) Data archiving in ecology and evolution: best practices. Trends in Ecology & Evolution, 26 (2): 61-65. - Wicherts JM, Bakker M, Molenaar D (2011) Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results. PLoS ONE 6(11): e26828. - Zittrain JL (2008). The Future of the Internet And How to Stop It. Yale University Press.
03/12/2012 at 09:14:11 PM Organization Data Preservation Alliance for the Social Sciences (Data-PASS) Ann Arbor, MI See attached comments. See attached comments. See attached comments. Submission being formatted for web posting - comments available soon
03/12/2012 at 09:46:26 PM Self University of Miami Miami FL Several limitations of current data repositories prevent researchers from leveraging these resources optimally. For example it is difficult and time consuming to query individual repositories across many datasets and analyze the results to support, test or develop scientific hypotheses. While it is challenging to query and analyze results from individual repositories it is close to impossible - in the absence of significant bioinformatics capabilities - to query across different repositories and integrate and analyze those results. Shortcomings include structural, syntactic and semantic inhomogeneity, and often the lack of sufficient and standardized annotations of the data and the experiments that generated the results. Although minimum information checklists (http://mibbi.sourceforge.net) and controlled metadata terminologies/ ontologies (http://bioportal.bioontology.org/) have been developed for several data types, their implementation is limited by the lack of appropriate tools that make these resources applicable to biologists. Beyond standardized data formats, minimum information, and controlled metadata terminologies, shared semantics are required to link diverse data, interpret and search the data, and to use the results for learning. While many domain-specific biological and biomedical ontologies already exist or are under development, they are lacking in the area of chemical biology screening. In many cases existing ontologies have not yet adopted W3C recommended OWL representation or lack in descriptivity. Perhaps the most critical limitation is the applicability of these ontologies by typical biologists. Tools that enable biologists to apply minimum information, metadata terminologies and ontologies to annotate their experiments and data are required. During the last two years (via an NIH funded project) we have been developing the Bioassay Ontology (BAO, http://bioassayontology.org/) to describe and formalize chemical biology screening experiments and results. We have applied this ontology to annotate data from a public repository (PubChem) and developed software, BAOSearch to query and explore this data leveraging the ontology (http://baosearch.ccs.miami.edu). For this project we have successfully brought together screening biologists, bioinformatics experts, computer scientists, and software engineers. We found this approach very successful. Our group is now working with the MLPCN, PubChem, and also the EU OpenScreen project to incorporate and evolve BAO as a standard to describe these bioassays and screening results. However, establishing a community standard relies on the availability of funding to maintain and evolve the ontology, map it to other domain ontologies (for example in the NCBO Bioportal), and align with mid-level and upper level ontologies to enable broad semantic integration across diverse data types. A long-term goal of NIH should be a distributed data management infrastructure that enables researchers to query across several repositories that hold diverse data types. Such capability would likely potentiate the value of each individual repository and enable data-driven science. A truly distributed data management infrastructure requires the development and implementation of rigorous data exchange standards, reasonable minimum information, appropriate metadata standards, standardized data formats, and, importantly a common semantic framework to formally describe the various data types and how they are related. While the vision of the Semantic Web is not new, several technological challenges have to be overcome (such as distributed reasoning and triple stores that can handle billions of records) and enabling tools that can be used by chemists and biologists are a critical requirement. Such an infrastructure has the potential to maximize NIH's investment into the numerous data production projects. NIH should support the implementation and maintenance of standards and descriptive ontologies that prove valuable to overcome current limitations of data repositories, find adaptation in the community, and work towards the long-term goal of a distributed information management infrastructure. Importantly, in order to be applicable by biologists and chemists, appropriate tools need to be developed and ontologies need to be modular and focused with a clear scope, application, and use cases. Their applicability should be demonstrated by software that can be used by scientists without extensive informatics training. Ontologies and derived tools must make use of WC3 recommendations (OWL, RDF, RDFS, SPARQL) to be compatible with existing and increasingly robust Semantic Web technologies. Domain ontologies should be integrated via mapping to other domain level ontologies and alignment to mid-level and upper level ontologies to provide an integrated semantic framework. Care must be taken to assure the intended (logical) interpretation and satisfiability of the resulting framework to enable data integration and reasoning to infer new (implicit knowledge) using Semantic Web technologies. In our opinion, a multidisciplinary team is required for such projects to be successful. Another important consideration should be the coordinated support of these developments between the United Stated and Europe and also Asia where the same standards are needed. For example BAO is also the leading candidate for the EU OpenScreen project (http://www.bioontology.ch/bioassay-ontology-for-eu-openscreen). The ISA metadata tracking tools project (http://isatab.sourceforge.net/index.html) is an example where coordinated funding for tools development could make sense. NIH should also support outreach and training activities to firmly establish standards. In data sharing policies, NIH should consider requiring rigorous implementation of standards in data repositories and require investigators to follow established standards when publishing their data to a repository. NIH should also consider rigorous implementation of semantics to formalize metadata and experimental results with the goal of a distributed information management infrastructure similar to the vision of the Semantic Web. However, rigorous annotation of data should not be an extra burden on researchers. User-friendly tools that leverage ontologies and implement minimum information and standard formats are required and their development and implementation should be supported.  
03/12/2012 at 10:51:14 PM Organization CTRC, University at Buffalo Buffalo, NY All comments are in attachment 1 Attachment is a document cited attached for ease of reference     Attachment #1: Comments by an ad-hoc Working Group under the auspices of the Buffalo Clinical and Translational Research Center, Buffalo Translational Consortium Summary recommendations Address issues of research data and patient data in an integrated framework. Consider support for Further development of standardized reference ontologies and promotion of their adoption; Development of efficient and effective tools for automated or computer-assisted annotation of data based on textual descriptions and images; Development and adoption of model consent forms for patients and study participants to maximize data reusability; Integrating awareness of data issues into biomedical education. Promote data sharing and interoperability through guidelines for NIH-funded projects and enforcement of these guidelines and through appropriate requirements for patient data systems that receive public support. Promote a culture where contributions to research through data receive proper recognition. Promote the development of a Science of Data. 1 Scope of the challenges/issues The question of research data cannot be separated from the question of patient data (especially EHR data) because patient data become or have the potential to become research data and, as importantly, research results should be presented in a way that facilitates rapid translation into patient care. The problems of interoperability and accessibility discussed below should be addressed in the framework of comprehensive plan that deals with both patient data and research data; here are shared solutions across both. Interoperability, comparability, reusability, and applicability of data depend on capturing and coding all the variables needed to understand, interpret, and combine data, including, but by no means limited, to the following: Phenotypes Diseases, disorders, conditions Anatomical parts at all levels, including cell types and genes Biological functions Chemical substances Clinical laboratory procedures and protocols in biology as a unified space Instruments Solving the issues of data creation, data quality, use, application, re-use, wider applicability, comparability, and interoperability, some of which are discussed in this submission, will require a science of data. 1.1 Research information lifecycle Interoperability and reusability of data start at the source. Original data, such as physicians' patient notes in any format (spoken, handwritten or hand-drawn, typed text) should be archived. At the same time, bio-specimen descriptions, patient data, and other primary observations should be annotated with standardized coding schemes. Obtaining quality coding that is usable beyond the original purpose is difficult in both research and clinical environments. It should be supported by software that (a) interprets text and images and suggests applicable codes (The state of the art has reached a point where the development of such software is possible; NIH should consider funding a survey of all techniques suggested (including handwriting and speech recognition of medical text) as the basis for developing effective coding software.); (b) guides the user (physician, medical technician) to finding the right code(s) for findings and diagnoses. Codes used in an annotation should be tagged by persons and/or computer programs involved in the codes' assignment, giving for each of these entities their role (original assigner, reviewer, .). 1.2 Challenges/issues faced by the extramural community Challenges can be divided into technical, legal, and social/cultural (changing attitudes, providing incentives). Challenges are addressed in detail in the later parts of this submission. We mention just two here: Biospecimen banks. Ideally there would be one search to find suitable specimens in many biospecimen banks across the country. Understandability and reproducibility and of computer code. Since computer programs play a large role in biomedical research ‒ for data pre-processing, such as base calling in gene sequencing, data analysis (including data mining from large data sets), and modeling/simulation of organisms and biological/biochemical processes ‒ it is important that readers of papers the present results based on computer programs be able to understand and reproduce the computer programs used to verify the results and to be able to use the technique in their own work. 1.3 Tractability with current technology Ontologies constructed and applied correctly are at the base of solving many data problems. Automated or computer-assisted annotation using linguistic, semantic, and machine learning techniques 1.4 Unrealized research benefits 1.5 Feasibility of concrete recommendations for NIH action 2 Standards development Standards development is closely linked with secondary / future use. Without standards, secondary / future use may be prohibitively difficult. While individual data sets can be used for further analyses, there is much promise for new discoveries in the combination of different data sets. There are two ways of combining data sets 1. Combine similar data sets 1.1 to reach a larger sample size. This requires that all the data of interest in a study are coded the same (or at least in ways that can be mapped). Even more strongly, it requires that the same protocols and measurement techniques were used in collecting data for the different data sets or that the methods used for collecting the data in different data sets are known to give the same results. 1.2 to compare studies of the same phenomenon, particularly if they reached different results. This may lead to the discovery of variables that account for the difference in results 2. Combine data sets to study the relationships between variables across data sets, for example physic-chemical properties of chemical substances and physiological functions of these substances. This is possible only if the units under study are identified the same way (or at least in ways that can be mapped) in both data sets. In the example, chemical substances need to be identified the same way (for example, ChEBI URIs) or in a way that can be mapped to ChEBI URIs. In database terminology it must be possible to define joints across data sets. Particularly useful for biomedical research is the combination of data sets that allow for merging data and locating biospecimens for the same patient (for an example see www.acceleratedcure.org/repository/). This approach clearly relies on data sets with identified patient data. There is a need for developing protocols for making such data sets available for research while protecting patient privacy. (See also Section 3.) A standard set of reference ontologies for all aspects of describing and encoding data (see 1. above) needs to be developed. Such a standard set of reference ontologies is key to both finding and using data sets. It is desirable that the standard ontologies be used for annotation ‒ possibly extended in a principled way that results in an extension of the standard ontology, but at the very least ontologies used for annotation should be mapped to standard ontologies in a way that accurately relates terms with same meaning. While ontologies were devised as a strategy for creating interoperability of data ‒ as illustrated most successfully in the case of the Gene Ontology ‒ the current situation is one in which every new group of researchers feels obliged to create their own local ontologies, thereby re-creating the very problems of silo-formation which ontologies were designed to solve. Searching almost any term in the NCBO BioPortal yields a multitude of responses in these different local ontologies, thereby providing insufficient guidance for those who may wish to follow good practice reuse ontologies which already exist. The OBO Foundry (http://obofoundry.org) has attempted to rectify this situation in a collaborative effort with many communities of researchers; activities to develop standard ontologies in this framework and to achieve their adoption should be promoted to the maximum possible degree. 2.1 Data standards, reference sets, and algorithms to reduce the storage of redundant data 2.2 Data sharing standards according to data type (e.g., phenotypic, molecular profiling, imaging, raw versus derived, etc.) We need standards for describing data sets as a whole. Standards need to support using interrelationships across data types. Standards need to support documenting the derivation of one data type from another. For example, phenotypic data can be derived from image data through interpretation by people or by software. 3 Secondary/future use of data We list here a number of issues and make a number of comments on the general legal/regulatory and social/cultural "landscape" of data sharing. These issues need to be addressed by NIH and DHS in consultation with researchers, clinicians, and patients (or their organizations) with sufficient weight given to the common good. Who owns a given data set? Is it in the public interest that there be ownership of personal health data by other than the patient? If produced with public funds, data ownership should be public, possibly with certain time-limited usage rights for the researcher who collected the data. Should anonymized patient data from hospitals (at least public hospitals, or of patients whose care is paid for by the public) be available for research to any qualified researcher? What new legislation is needed? Who is responsible for assuring responsible use of a given data set? A custodian of the data set (a role different from owner) or a researcher re-using a data set? What kinds of agreements are needed between data stewards and researchers using the data? How should researchers account for the responsible use of data? What is the right balance between preventing misuse of data (a "policy based on fear") and supporting use of data for scientific advances that in the end produce benefits for patients? 3.1 Ways to improve efficiency of data access requests (e.g., guidelines for Institutional Review Boards) IRB protocols, assuming the proper consent from study participants, could incorporate something analogous to a Creative Commons license for the reuse of data. Aligning this with patient consent procedures would be very useful (see 3.3 below). 3.2 Legal and ethical considerations 3.3 Comprehensive patient consent procedures This concerns not just patients, but study participants of any sort. Consent forms for patients and consent forms for study participants should follow the same principles. This is in keeping with the general principle that access to research data and patient data should be addressed in an integrated way. Create a model patient and study participant data sharing consent form with the intent of having it adopted as a standard by most health care providers and IRBs. This should include: 1. Patients or study participants consent to use anonymized data in research and to integrate anonymized data into larger data sets. Patients may have the option to agree to research use of their data with their name attached with a restriction that their name not appear in publications based on the data, or waiving even that restriction. This is important for research that needs to pull together data for the same patient from multiple sources (see also Section 2). 2. Patient consent to include identified data in secure comprehensive patient data systems that integrate patient data from multiple health care providers for improved patient care. Specific permission allowing the patient's health care providers to obtain all the patient's data from the comprehensive data system. Also permission to compare data against new research results that might have an influence on the patient's treatment and notify health care providers of any results. From such integrated data patient data systems anonymized patient data sets could be produced that are much richer and useful for research than anonymized patient data from a single source. 3. Patient consent to use data to determine eligibility for clinical trials. While 2. and 3. do not pertain directly to data sets, they should be included in the consent form so as to integrate all data use permissions and encourage patient consent Similar model language should be developed to use with IRB protocols and clinical trials. One might also consider two mechanisms whereby patients who are so inclined can contribute data to research: (1) There is an increasing trend that patients maintain a copy of all their medical data from whatever source. A repository could be established where patients could contribute these data. Such a repository could assign a patient identifier that cannot be easily traced back to the person. (2) A "data donor card" through which patients could agree to the use of their data after death, with the patient's specifications of restrictions (possibly none). In this context, the problems of de-identification and anonymization (http://julesberman.blogspot.com/2007/05/difference-between-de-identification.html) need to be addressed. With de-identified data (and even more so with anonymized data) it is not possible to pull together data for the same patient from multiple sources, thus preventing some kinds of research. 4 Data accessibility Data repositories and other data sharing mechanisms should support communication between data producers and researchers who want to re-use the data. A data producer should be notified when someone shows interest in using his or her data set. This might then lead to collaboration. Collaboration might shorten the embargo time for a data set. One problem of data access is proper authorization of anyone who requests access to data and establishing their need for the data. This can be facilitated by systems like ORCID (Open Researcher & Contributor ID) (http://about.orcid.org/) 4.1 Central repository of research data appendices linked to PubMed publications and RePORTER project record This is a good idea. Depositing data from NIH projects in this repository would be an enforceable requirement in the data sharing plan. Data sets should be independently identified and citable. 4.2 Models and technical solutions for distributed querying There are competing standards for data set description (as opposed to just data set citation). These should be unified to the extent possible; at the very least, all descriptions should cover a standard set of information elements that are important for retrieving and using a data set. This will require some work. We need better ways to describe data models and the structure of data. This should be considered a research problem and funded as such. The emerging semantic web along with its evolving standards should be encouraged, with instructions to publish data in this medium when possible and appropriate. 4.3 Comprehensive investigator authentication procedures 5 Incentives for data sharing Consider data sharing as part of the scholarly record for promotion and tenure (already in place at the University of Buffalo). Making a data set available (generally through a repository) would secure some credit. Impact would be measured by the number of times the data set has been used and possibly by the impact of the resulting publications. This requires a change of attitudes and procedures (perhaps prodded by NIH), but does not incur costs. Depending on the importance of the data used in a publication, the creator/collector of the data should be listed as an author (a practice that already exists in some domains but should be extended) or mentioned in the acknowledgments. Data sets used should be properly cited, and mechanisms for enabling such citation (such as assignment of DOIs for data sets) should be encouraged. Making a data set available may lead to collaboration and jointly authored publications (see 5.1). NIH should consider more emphasis on research that reuses existing data, especially combinations of data sets, in innovative ways. Such research is not necessarily hypothesis driven - the value of data mining projects should be recognized by appropriate funding mechanisms. Along with these carrots there should also be a stick. NIH should consider more emphasis on enforcing the data sharing provisions in its grants and contracts. These provisions should furthermore require that data sets be deposited in a NIH-approved repository that is known for efficient data sharing. Allowable embargo times on data sets should be approved by the NIH project officer as part of the data sharing plan. NIH should consider funding projects devoted to the creation of data that would be useful to many researchers. The resulting data would have to be available without restrictions except restrictions mandated by law or the consent given by study participants. In some cases good data available to many researchers are more valuable to the progress of science than study results. To summarize: NIH should consider developing guidelines for data sharing and then negotiating data sharing plans for funded projects using these guidelines. Such guidelines should include standards for annotating data because otherwise the shared data are not useful. Researchers would need to provide reasons for restrictions on data access in the data sharing plan. Once a data sharing plan is established, NIH should enforce it. 5.1 Standards and practices for acknowledging the use of data in publications This may seem almost trivial but is important in terms of incentives. If a publication is based primarily on reanalysis of one data set, then the producer(s) of that data set should be listed as co-authors with their role specified. Otherwise data sets should be listed in "data sources" part of the bibliography. Any of various formats can be used as long as the creator of the data set is clearly identified so that proper credit can be given (for example in citation counts). Indexes such as the h-index should include citations to data sets. Many bibliographic citation managers already include formats for data sets, others should be encouraged to follow suit. 5.2 "Academic royalties" for data sharing (e.g., special consideration during grant review) 6 Support needs See summary of recommendations. 6.1 Analytical and computational workforce growth There are more aspects of workforce development Biomedical researchers should be made aware of ontologies and their importance (and usefulness to the researchers' work) early on through inclusion of appropriate modules in the curriculum. Good data curation from the conception of a research plan through depositing a data set in a repository to maintaining and enhancing the data set (for example, by tagging it with possible uses). Researchers are often neither willing nor able to do this. So there needs to be a cadre of research data assistants / data librarians who assist researchers with curating their own data and with finding data sets for re-use. There should be stronger relationships between researchers and bioinformaticians and biostaticians from the outset of a research project. Education that specifically emphasizes such collaboration of specialists should be encouraged. This will lead both to better studies and greater reusability of data. 6.2 Funding for tool development, maintenance and support, and algorithm development Especially tools for automated or computer-assisted annotation of data based on textual description and images, bringing research and algorithms in this area together to develop systems that can be deployed widely. (See 1.1) 7 Observations on related issues in patient data The use of proprietary coding schemes in patient data systems is a major obstacle to patient data sharing in health care and to re-use of patient data (usually anonymized) in research. Therefore, the appropriate federal and state agencies should encourage the use of standardized coding schemes in patient data (electronic health record) systems. Possible incentives include providing public support only to installations of patient data systems that use standardized coding schemes. Patient data could be much improved by having a more developed data infrastructure including "data nurses", personnel at the level of nurses that support the proper capture, coding, and curation of patient data. Technologies that enable transparency, such as easy availability of term definitions without restriction, should be considered mandatory, given that interpretability of patient data is of such high value. Related documents and sources (1) The following response to the related RFI https://www.federalregister.gov/articles/2011/11/04/2011-28621/request-for-information-public-access-to-digital-data-resulting-from-federally-funded-scientific makes recommendations that are also useful to NIH: Haendel, Melissa et al. Preservation, discoverability and access. This document is attached to our submission within the same document. (2) The following chapter provides a good overview of the need for semantic interoperability to improve patient care, the present state of affairs, and recommendations for improvement: Dipak Kalra, Mark Musen M, Barry Smith, Werner Ceusters, Georges De Moor. ARGOS Policy Brief on Semantic Interoperability. In: G. De Moor (ed.),Transatlantic Cooperation Surrounding Health Related Information and Communication Technology (Studies in Health Technology and Informatics 170), 2011, 1-15 http://ontology.buffalo.edu/smith/articles/Argos_Semantic_Interoperability.pdf The following report provides an in-depth analysis of semantic interoperability prepared as background for the shorter chapter Ceusters, Werner; Smith, Barry Semantic Interoperability in Healthcare. State of the Art in the US A position paper with background materials prepared for the project Argos. Transatlantic Observatory for Meeting Global Health Policy Challenges through ICT-Enabled Solutions. March 3rd, 2010 New York State Center of Excellence in Bioinformatics and Life Sciences. Ontology Research Group 701 Ellicott street Buffalo NY, 14203 USA This document is attached to our submission as a separate upload (3) A good resource for research on data resource is found at www.researchremix.org/wordpress/ (website of Heather Pinowar) On behalf of the Resource Discovery Group, a consortium of researchers from eagle-i (https://www.eagle-i.net/), Vivo (http://www.vivoweb.org/), the Neuroscience Information Framework (NIF; http://neuinfo.org/), Biositemaps, and the CTSAs, who are interested in promoting research resource representation and discovery in the scientific enterprise. Preservation, Discoverability, and Access (1) What specific Federal policies would encourage public access to and the preservation of broadly valuable digital data resulting from federally funded scientific research, to grow the U.S. economy and improve the productivity of the American scientific enterprise? Federal agencies should create a technical standard that enables discovery, usability, attribution, and long-term preservation of digital data. These specifications need at a minimum to include the archiving of data in publically accessible repositories, using standard record and metadata formats, and promoting best practices for interoperability and reuse, such as Semantic Web standards and Linked Open Data. Once the technical standards are there, policy can be established that requires data to be made available in a compliant manner as a deliverable of all federally funded grants and contracts, not only for those over $500,000. Grants with a data-sharing component should have a required budget line item for data sharing and archive. A critical aspect of this policy will be to define "digital data" in the context of the policy. Funding agencies can support researcher efforts to meet the policy requirements by integrating semantic reference to these digital data into grant application and reporting structures. With appropriate tactical issues worked out, funding agencies could partner with publishers to require (and verify) data sharing before research results can be published. Finally, award and incentive systems (including institutional APT committees) must recognize the value of quality data management and sharing to the scientific enterprise. It is estimated that it costs $24,100 and from 1.5 to 3 years to develop a transgenic mouse from scratch (eagle-i, unpublished economic analysis). What if that mouse were available to the research community at or during its development? This could expedite both public and private research endeavors. One of the issues is that "this" mouse is neither shared nor represented in a standardized manner such that it can be found for general reuse. It is not until a curator at a specialized database sees it in a publication, that it becomes part of the public record of available resources- sometimes years after it was developed. The point here is that the metadata about research resources themselves is digital data, and standardized representation and sharing of research resources should be included in any digital data policy. It should be noted that a lack of data annotation and sharing may not be for lack of desire to do so. Funding agencies, libraries, and research offices should offer training and helpdesk facilities to educate researchers in best practices for data annotation and sharing. (2) What specific steps can be taken to protect the intellectual property interests of publishers, scientists, Federal agencies, and other stakeholders, with respect to any existing or proposed policies for encouraging public access to and preservation of digital data resulting from federally funded scientific research? One issue is that currently it is largely only publications and patents that are attributed. The scope of attributions needs to expand. Researchers need the ability to access the components within the publication (e.g. a knockout mouse, viral vector, database, datasets, etc.) This will protect the interests of individual stakeholders and they will feel more inclined to share these important and relevant outcomes of the scientific enterprise. Specifically, data sets can be citable, authored sets of information that can be referenced in the context of publications, grant reports, etc. While mechanisms are underway to support such efforts (Bioresource Research Impact Factor, Beyond-the-pdf, nanopublication), it will not be until funding agencies, employers, and publishers consider such citations in the context of evaluating a candidate proposal or manuscript that they will be adopted. However, federally funded research produces data that is generated using taxpayer money and it belongs to the people. The person who generated it has no intellectual property interests on the data. What they do with it is a different matter, and they should be given some amount of time to do something with it (publish, patent, market, etc.)- 9 months or a year, perhaps. (3) How could Federal agencies take into account inherent differences between scientific disciplines and different types of digital data when developing policies on the management of data? In developing policy that accommodates differences between scientific disciplines, libraries and information science researchers that are accustomed to providing guidance and resources for disparate kinds of data should be consulted. While data differs in different disciplines, there are qualities common to all data types, and these should inform inter-disciplinary requirements. For instance, there exist upper ontologies that represent the types of things that exist. Classification of data elements can be tied to such upper ontologies via reuse of these upper ontologies. One example is the Basic Formal Ontology as the upper level ontology for all Open Biomedical Ontologies (OBO; http://www.obofoundry.org), which enables representation of a catheter, a zebrafish liver, diabetes, and regulation of cell adhesion. These entities may not on the surface appear to have anything in common, but use of a common upper ontology can facilitate data integration about all of them (for example, in the context of designing an experiment). However, it is equally important to consult the end-user who is attempting to query across disciplines to ensure data consistency of representation. To this end, existing discipline/data specific repositories should also be consulted to ensure applicability. Furthermore, to support innovative reuse of digital data, it is important to recognize that these uses are not usually the original creator's intent. Data from disparate disciplines, projects and sources can be combined for synthetic and synergistic scientific inquiry - this in itself will also support new markets. Interoperability standards will benefit these new applications. Therefore, each discipline may require specialized data formats, queries and applications, but federal agencies can promote open and extensible standards to meet cross-disciplinary needs. Another facet of this that must be considered is the extent to which there exist different data sensitivity issues in different fields. For example, publication about uranium enrichment metadata may require different consideration than data on the Arabidopsis genome. (4) How could agency policies consider differences in the relative costs and benefits of long-term stewardship and dissemination of different types of data resulting from federally funded research? The most important aspect of what federal agencies can do with respect to garnering an understanding of the cost benefit analysis of stewardship and dissemination is to promote scientific inquiry that depends on public digital data. It is currently difficult to obtain funding for such projects, and as such, it remains somewhat of an idealistic rationale that publication and availability of data will be good for the research enterprise. In fact, we know that it is difficult to reuse others' data without standards, and it is often more cost-effective and time saving to create one's own data. If we are to tip the scales and actually save time and money, it will be because there exist standards and requirements to promote data reuse. Such requirements can be met via interagency collaboration, standardization, and cost sharing. In doing so, there is the potential to control costs and maximize benefits by limiting duplicate efforts, distributing responsibility, and by educating researchers. Furthermore, with respect to research resources, there is a clear indication that reuse of such entities saves time and money. If standards were promoted to enable their identification and relevance, and researchers incentivized via funding streams to leverage preexisting resources, this could lead to a very solid understanding of cost-benefit to sharing digital data for these particular data types. (5) How can stakeholders (e.g., research communities, universities, research institutions, libraries, scientific publishers) best contribute to the implementation of data management plans? Participation by the many stakeholders must be regulated by technical and legal standards to ensure and promote free public access, discovery, re-use, and preservation. The expertise and methodologies of these stakeholders should be leveraged collaboratively both in the development of policy and in its execution. Such collaboration is required for success, and can drive best practices, innovation, market creation, and compliance. The present repositories of research communities, publishers, and institutions can be utilized and developed (e.g. Pangea, TreeBase, eagle-i, NIF, Biositemaps). Existing partnerships between pu
03/12/2012 at 10:53:38 PM Organization Oregon Health & Science University Library Portland, OR Scope of the challenges/issues The need for information and data management literacy extends beyond a national mandate for sharing and public access - the scientific community must embrace a culture where every scientist needs to understand how to manage, navigate, and curate huge amounts of data. Currently, much data is not uniquely identified in publications, and data and resources that are unpublished are largely unavailable. Even when researchers want to share such things, they lack the training to do so in a meaningful way and with a visible and efficient means to do so. Projects such as NIF and eagle-i can facilitate the latter, but the former is still lacking. For example, eagle-i found that only 12% of labs surveyed had any kind of resource inventory system. Gone are the days of brown lab notebooks - we are more disorganized as scientists than ever before. Not only does this lack of sharing data and resources result in lost opportunities for efficiency, cost savings, and collaboration in research, it means that experiments aren't reproducible due to lack of specificity.

To improve this situation, we recommend that the NIH a) Provide more training in information management as part of graduate training programs and other outreach opportunities b) Support software development to be used within the course of research to track data and resources such that they can easily be shared in the context of publications or other resource and data sharing venues; and c) Provide recognition for sharing data and resources.

The creation of data and other reusable research resources are scholarly outputs, which if made available to others, can significantly reduce the cost of future research and promote innovative science.

Standards development

Management systems for biological data: Since the late 1980's, the field of data management has been dominated by relational database management systems (RDBMS), which revolutionized the field from homegrown file system data management approaches. Normalized RDBMS and Structured Query Language (SQL) solved these critical problems, but we are now running up against their limits: RDBMS simply do not scale to petabytes of data, and SQL was not designed with the Internet in mind. Today, cheap servers are being used to build both public and private data clouds that scale horizontally across multiple distributed systems. However while these NoSQL databases are both flexible and scalable, they are a setback in terms of data integrity, consistency, and non-redundancy. According to Brewer (2012), the CAP theorem states that it is impossible for any distributed system to simultaneously guarantee Consistency (all endpoints see the same data at the same time), Availability (every request receives a timely response no matter where on the network it comes from), and Partition tolerance (the system continues to operate during any delays in making the data consistent across locations). The technical challenge is finding a solution optimized for the needs of biological research where the accuracy and precision of the results are the primary requirements.

Standards for discipline-specific data: It would be burdensome and unproductive to identify, set, and enforce discipline-specific content standards for every discipline. Rather, what should be pursued are standards that optimize discovery and use by human and machine readers, enabling data to be stored in one place or in distributed repositories while remaining discoverable in one place and by similar mechanisms. Standards required include format standards, minimum metadata requirements, transport and access standards, Semantic Web standards, and Linked Open Data. The Semantic Web is a collection of representation and transport based on http standards. Access standards (who can access what and how this information can be presented) are well developed in commercial areas and poorly developed in science. The W3C standards development process has successfully produced HTML, XML, RDF and other languages. In domains including web technologies, geospatial data (FGDC ), and ecology (EML ), standards have been a success because of openness and community participation. Standards development relies on the contributions of a diverse population of experts, including scientists, information professionals, and technologists.

Requiring standards on metadata and standards on repositories, is "the key component to enabling effective, scalable data transfer between independently developed systems." (Goddard, et.al., 2011). Such metadata must include disambiguated reference to people, such as the ORCID effort (Fenner, 2011), and unique identification of data sets and resources to ensure data integrity. Repository standards are necessary to enhance the ability to create tools that can systematically perform searches across or within repositories.

Unique identifiers. To facilitate linking between publications and data, the use of persistent, unique identifiers for data, research resources, publications, authors, and institutions is required. These identifiers should be unique Internationalized Resource Identifiers (IRIs)-the standard for identifiers on the World Wide Web, so that the data and/or metadata about the resource can be made directly available. Unique identifiers enable visible links between entities as well as re-use and the development of new services. For example, browsing a publication could include integrated data displays. In support of this functionality, we need standards for citing datasets and models, along the lines of what SageCite (http://www.ukoln.ac.uk/projects/sagecite/) is working towards. Further, linkouts between publications, datasets and resources are needed; similar to the way they work today for genes in NCBI. Clicking on links to research resources could point the user to the resource or to information on how to obtain the resource.

Ontologies. With regard to a standardized classification scheme, aka metadata, we suggest that NIH actively support the development of a very small number of critical ontologies and actively discourage the proliferation of project-specific ontologies. Our recommendation is that NIH fund the OBO Foundry (currently a completely volunteer community-based group) to coordinate community-driven standards development. Supporting OBO Foundry will go a long way towards ensuring that quality, shared ontologies that can directly address NIH research priorities are available. String-matching methods (or versions of it) will not resolve all mapping issues across the hundreds of existing 'ontologies'. Raw mappings, while they offer a starting point, contain too many false-positives and false-negatives to be used in biomedical research. Ontologies, like textbooks, must be written and edited by knowledgeable domain experts. While this may seem an open-ended task, in fact the number of core ontologies (from which other ontologies may be built by combining classes from the cores) is quite small: function (GO), environment (EnvO), cell types (CL), anatomies (xxAO), chemicals (ChEBI), processes (GO), the possible relations between things (RO), the qualities of the biological things we deal with (PATO), and a few others. We suggest that for each NIH programmatic goal there is an early assessment of the interoperability and other standards requirements needed to achieve that goal, followed by a "gap analysis" to determine which standards are available, and which must be developed by the NIH supported Standards Setting Organization-the OBO Foundry.

Support innovation by supporting standards: The adoption of stable, well-documented, community-accepted standards for the both the syntax (e.g. OWL, GFF3, etc.) and the classification scheme used to describe the data being published, together with training in information management as part of graduate training programs, will make it possible for individual researchers to readily publish their data and, equally important, to consume and use published data in their own research. In addition, we recommend NIH offer support to researchers evaluating and developing applications that can leverage the new noSQL data management technologies (e.g. Google's Big Table, Yahoos PNUTS, and others). Very few researchers think about standards, however, standards are essential for innovation as they provide a stable foundation on which new technologies can be created. By clearly specifying the acceptable syntaxes and mechanisms by which data generated through NIH funding is to be published on the Web (to create nodes/URIs for Linked Open Data), NIH can ensure that the data is conserved for downstream researchers to discover and leverage to explore hypotheses. Furthermore, stable data publication standards will not only provide a means of salvaging the data assets generated through NIH funding, but also make it possible for software developers to design and develop innovative new applications for extracting further information from the data. Secondary/future use of data and data accessibility

While it is difficult to anticipate the future use of data, there is evidence that markets will emerge to support the management and usability of the data. For example, the Human Genome Project (HGP) has resulted in groundbreaking discoveries and therapies, such as Dr. Brian Druker's and Oregon Health & Science University's development of the cancer drug Gleevec, which resulted from the research sharing and advances that the HGP fostered. Initially, nearly four billion dollars was invested in the HGP. In 2010, the industry produced $67 billion in U.S. economic output, $20 billion in personal income for U.S. citizens, and 310 thousand jobs (Battelle Technology Partnership Practice, 2011). Clearly, data sharing has societal and financial value in addition to research value.

Federal agencies should support the development of technical standards that enable discovery, usability, attribution, and long-term preservation of digital data. These specifications need to include the archiving of data in publicly accessible repositories, using standard record and metadata formats, and promoting best practices for interoperability and reuse, such as Semantic Web standards and Linked Open Data. Once technical standards are developed, policy can be established that requires data to be made available in a compliant manner as a deliverable of all federally funded grants and contracts, not only for those over $500,000. Grants with a data-sharing component should have a required budget line item for data documentation, sharing and archiving. A critical aspect of this policy will be to define "digital data" in the context of the policy. Funding agencies can support researcher efforts to meet the policy requirements by integrating semantic reference to these digital data into grant applications and reporting structures. Some work has been done to specify such methodology in the cross-agency efforts of SciENCV (http://rbm.nih.gov/profile_project.htm).

Disambiguation services such as the Virtual International Authority File (VIAF, http://www.oclc.org/research/activities/viaf/) focus on organizations and people and offer a promising path forward for improving data quality. Other efforts could be established using open tools such as Google Refine (http://code.google.com/p/google-refine/) to support structuring data according to common standard and to enrich and validate data using available services (Google maps, Freebase and so on). Web services from data repositories could be integrated into desktop systems, websites, and publication submissions tools. Services that enable linking to data and linking both data and publications to known identifiers or terminology at the time of submission of a new publication could push much of the linking upstream to where incentives for documenting work are the highest.

Clinical data is particularly difficult to reuse. There is no "human model organism database" that collates the kinds of information that the model organism databases collate. Data in electronic medical records (EMRs) is structured differently than the data collected for clinical and non-clinical research purposes; the lack of data representation standards and coding for billing purposes makes it difficult to perform clinical translation. A significant amount of effort is expended to protect patient's privacy, but protecting their rights to share their information is rarely considered. If one had access to a cloud-based all-inclusive EMR for oneself, and opted to share all or portions thereof with research institutions, this would not only provide a more extensive clinical representation of a patient but also make the data more accessible to research for those that want to share it. Some preliminary work towards this end is being done by Obeid, Gabriel, and Sanderson (2010); it is hoped that such a system will be implemented in the context of all CTSAs and pilot a better access standards strategy. Incentives for data sharing

Attribution: The enormity of data available to scientists provides incredible opportunities for innovative research, but maintaining and navigating such datasets poses major obstacles. A recent survey reported that 85% of scientists surveyed are interested in using other researchers' data, but only 36% report their own data is easily accessible (Tenopir, et.al.,2011). One issue is that shared data and reusable resources are generally not attributed - typically only publications and patents are attributed. All current means of rating researchers-such as the H-index, Scholarometer, Google scholar citations, Reader Meter, etc.,-rely solely on citation data. Citation count is a widely used metric for providing statistics for tenure and promotion, but counts are only one measure of scholarly research. These researcher assessment measures disregard other metrics such as the number of downloads, the number of blog-posts, accessions, etc. Recognition of the production of shared data sets and the development of reusable resources as citable, authored sets of information will provide incentives for data sharing while protecting the interests of individual stakeholders. Data citation provides a framework for recognition that is based on the currently accepted model of citation within publications. Improved citation and data linking tools: Because discrete components of published research (e.g. a knockout mouse, viral vector, assay, database, data set, etc.) are not cited, existing impact metric tools do not take them into consideration when calculating impact factors. However, new tools are now being designed that incorporate other measures. Examples include Total Impact, Science Card, and FigShare and, additionally, can offer a means to indicate the funding source. These metrics are expandable and can incorporate any research artifact that a scientist produces in the course of their work, providing it has a stable identifier. While mechanisms are underway to support such efforts (Bioresource Research Impact Factor, Beyond-the-pdf, nanopublication), it will not be until funding agencies, employers, and publishers recognize the value of such citations that they will be adopted. Researchers can then list these scholarly outputs on their CVs, biosketches and grant applications, and tools can be built to leverage them to show the breadth of a scientist's expertise and productivity. Funding agencies could partner with publishers to require and verify data sharing before research results can be published. Award and incentive systems, including institutional promotion and tenure committees and funding agencies, must recognize the value of this expanded set of scientific outputs to the scientific enterprise. However for the new tools to offer the functionality needed, NIH must become involved and contribute funding and guidelines to their development. Support from NIH and other research institutions on the award and incentive systems would greatly encourage scientists to comply. Support distributed collaborative research: Co-authored biological research papers have been rising over the decades since computers have made collaboration easier. A 2002 National Science Foundation report found that the proportion of co-authored papers rose from 7 to 17 percent between 1986 and 1999. Production of biologically meaningful and clinically consequential breakthroughs is more commonly found in teams of scientists than single scientists. The Internet itself (TCP-IP) and HTML (i.e. the Web) originally arose from the needs of physicists to collaborate. As financial resources become scarcer and biological problems more complex to answer, biologists are compelled to collaborate more and to be more technologically innovative to succeed. Collaboration increases research quality, fosters the rapid growth of knowledge, and plays an important role in the training of new scientists. The virtual spaces that are created to allow researchers from different locations to work together are called "Collaboratories."

We recommend that NIH consider funding software developers in much the same way that HHMI or the NIH Pioneer Awards fund people not projects. The goal here is to have these developers work in the context of these "Collaboratories." The history of the field is that the best software in the field is often an unplanned labor of love from a single individual, and that the disparity between the best developers and average ones is enormous (i.e. >10-fold). Business studies recommend models that enable highly skilled developers to focus on what they do best; and the best developers are often not grant writers. We urge NIH to fund research that will develop technologies that address the challenges of collaborative projects, have better mechanisms to include highly-skilled developers, and preferentially support collaboratories rather than vertically organized centers.

Support needs

Sharing begins during research: "Increasingly large-scale data generation methods - sequencing, mass spec, imaging, and more - are becoming available to individual investigators, not just large NHGRI-funded genome centers." (Eddy, 2010). Ideally, the fate of any data produced by a researcher is that it will be conserved-including the original results that are generated for archival, the processed data, and the conclusions that are drawn. To accomplish this, the publishing requirements for the data must be simple and straightforward, so that any researcher, whether PI or graduate student, can quickly and easily figure out the process. If the barrier is too high, the researchers will not comply with data sharing, and the data will be lost to the wider community. This is a waste of research funds. The benefit of encouraging data sharing during research is the creation of a new age of semantic awareness from the researcher, the reviewers and the publishers of manuscripts, data and resources. Currently, virtually all scientific metadata is captured by professional biocurators who read primary journal articles or health records and translate this into computer terms for manual data entry, well after the research is completed. Enhancing current research training to include modern information management strategies will be necessary to building awareness, and funding agencies should support integration of information management into research workflow. Essentially this is what was done for the ENCODE projects, and for the high-throughput phenotyping projects, but this could be considered more generally across numerous types of research projects, crowd-sourcing (or researcher-sourcing), and the capturing of metadata. For example, smart phones apps that would enable a researcher to capture the environmental conditions on the spot as they collect samples. Further, this could be an opportunity to expand the incentives for researchers by providing recognition or compensation to researchers who have shared their data on previous grants, who volunteer to be a "peer reviewer" for data that is cited in articles, and for researchers who plan their data needs ahead of time through data management or data sharing documents included in the grant application.

The most important step federal agencies can take to promote data stewardship and dissemination is to promote scientific inquiry that depends on public digital data. It is currently difficult to obtain funding for such projects, and as such, it remains idealistic to expend effort on careful representation, curation and provision of data for others. Currently, it is often more cost and time effective to create one's own data. In order to change the cost-benefit of reuse, there must be standards, rewards and requirements for data sharing and funding to do so. With respect to research resources, there are clear indications that reuse saves time and money. Promoting standards to enable resource or data identification and relevance and providing researchers with incentives via funding streams to leverage preexisting resources or data, could lead to a very solid understanding of the cost-benefit of sharing for these particular data and resources.

There are emerging, standardized, reusable repositories available for the different types of data, such as Dataverse (http://thedata.org/) which houses social sciences or medical/health data as it relates to human behavior, Brenda (http://www.brenda-enzymes.org/) for enzyme data, and numerous model organism databases (for example, http://www.zfin.org). The researcher must know about all the different repositories in order to search for what they need, and the number of such repositories is only growing. Systems such as VIVO (http://www.vivoweb.org), eagle-i (http://www.eagle-i.net), NIF(http://www.neuinfo.org), and LAMHDI (http://www.lamhdi.org) link data and resources to people and publications, and act as information aggregators using Semantic Web technologies. Making public data more visible, navigable, and useful can be accomplished by financing repository aggregators. The VIVO application and ontology are currently being extended with funding from the Institute for Museum Library Services to support basic dataset metadata and registry functions. Financing more projects and tools that promote domain specific databases to push and pull their data to the aggregators and to the Semantic Web will support data sharing. The current funding model to support tool development and maintenance surrounding the use and/or creation of shared data is to apply for grants. This may not be the best long-term solution as it is competitive and does not necessarily allow for continuity - leading to a loss of data access and poor public support. We could look to the UK for possible solutions. "The UK Government announced its intention to create a Public Data Corporation (PDC). This would bring together data-rich organizations with the aims of: Providing a more consistent approach towards access to and accessibility of Public Sector Information, balancing the desire for more data free at the point of use, whilst ensuring affordability and value for taxpayers; Creating a centre of excellence driving further efficiencies in the public sector; and Creating a vehicle that can attract private investment" (Data.gov.uk).

Support development of methods for both Data Publication (above) and Data Discovery: While data differs in different disciplines, there are qualities common to all data types, and these should inform inter-disciplinary requirements. Looking back on ten years of web publication, the classification schemes built to provide metadata (aka ontologies defining classes of data) are orders of magnitude smaller than the accumulated instance data they are built to describe by ratios of one to several hundreds. We contend that only a small number of ontologies are actually needed, which means the ratio will increase dramatically. This leads to a significant problem: the democratization of scientific data publishing will necessarily lead to an exponentially increasing volume of data. Assuming that all of this data is described using a relatively small number of community standard ontologies, how can this data universe be searched and compared?

Upper ontologies are available that represent the types of things that exist. Classification of data elements can reuse these upper ontologies. One example is the Basic Formal Ontology (BFO; http://www.ifomis.org/bfo) as the upper level ontology for all Open Biomedical Ontologies (OBO; http://www.obofoundry.org), which enables representation of a catheter, a zebrafish liver, diabetes, and cell adhesion regulation. These entities may not initially appear to have anything in common, but use of a common upper ontology can facilitate data integration about all of them (for example, in the context of designing an experiment) and can facilitate data publication and consumption on the Web. Further, leveraging Linked Open Data (Heath & Bizer, 2011) and standardized ontologies together provides the potential for scientific inference - from linking gene expression data in one organism to phenotypes in another, to identifying experimental bias or even false or incorrect data.

Data analysis. Current compute times for even relatively small knowledgebases can run on the order of hours or days. If it was simply the volume of the data that was a problem, it could be handled with sophisticated indexing methods. However, the complexity of the numerous rules stating what is true about the world in general and the relationships manifested in the ontologies, makes it expensive to do analyses over the data. While the Web is a good platform for data publication, it's a poor platform for data consumption. Since compute speed decreases with distribution, centralization is necessary to make the computations efficient. NIH needs to support a) data centers that harvest all the data from the growing number of LOD sites relevant to a particular research question, and b) developing the highly efficient statistical correlation methods that are needed to discover the biologically informative information.

Libraries are an under-recognized resource in the field of data and information literacy. Librarians and scientific curators have increasingly become experts in data management because of their combined knowledge of new data sharing standards, information science, and the Semantic Web (Khan, 2010). Numerous libraries are now working to improve support of their local research communities with respect to data access and discovery . Spending time and money on data management and curation, valuing the scientists (largely librarians and curators) that perform this work, and using science to prove the value of organized and shared data, are all required to change this attitude (Lesk, 2011). New grants are required to have a data sharing plan, but they rarely include funding for people to do this work. Better funding for this aspect of any research will be a prerequisite to successful data sharing, and NIH could recommend this in their PAs and RFPs.

Incentives for data sharing - The NIH must have access to the researcher data in order to accomplish its goal of managing, integrating, and analyzing large biomedical datasets. However, if researchers are not rewarded and/or recognized for sharing their data, then there is no incentive for them to take the time and incur the costs associated with organizing and classifying their data and finding an appropriate place to permanently store the data; thus, making it accessible to the NIH and others.

Support Needs - In addition to funding support, researchers must have tools in place to help streamline the process of sharing their data and making it discoverable. If the process for sharing data is time-consuming and difficult, the researcher will be disinclined to share his/her data. In addition, if the data is difficult to find, then the benefit of sharing data is reduced. The NIH must address the procedures and tools available to researchers and ensure they make sense, are easy to use, and provide tangible benefits to the researcher. Current compute times for even relatively small knowledgebases can run on the order of hours or days. If it was simply the volume of the data that was a problem, it could be handled with sophisticated indexing methods. However, the complexity of the numerous rules stating what is true about the world in general and the relationships manifested in the ontologies, makes it expensive to do analyses over the data. While the Web is a good platform for data publication, it's a poor platform for data consumption. Since compute speed decreases with distribution, centralization is necessary to make the computations efficient. NIH needs to support a) data centers that harvest all the data from the growing number of LOD sites relevant to a particular research question, and b) developing the highly efficient statistical correlation methods that are needed to discover the biologically informative information.

The NIH would need to have policies and processes in place to either reward and recognize the researchers for sharing their data or punish and harm the researchers who do not share their data. When the benefits of data sharing equal or outweigh the costs, then researchers will be more inclined to comply with data sharing regulations.

The NIH would want to have policies and procedures for format standards, minimum metadata requirements, Semantic Web standards, and Linked Open Data in order to ensure that the supplied data can be reused and can be transferred easily between researchers.

The most important step federal agencies can take to promote data stewardship and dissemination is to promote scientific inquiry that depends on public digital data. It is currently difficult to obtain funding for such projects, and as such, it remains idealistic to expend effort on careful representation, curation and provision of data for others. Currently, it is often more cost and time effective to create one's own data. In order to change the cost-benefit of reuse, there must be standards, rewards and requirements for data sharing and funding to do so. With respect to research resources, there are clear indications that reuse saves time and money. Promoting standards to enable resource or data identification and relevance and providing researchers with incentives via funding streams to leverage preexisting resources or data, could lead to a very solid understanding of the cost-benefit of sharing for these particular data and resources.

References Battelle Technology Partnership Practice. (2011). Economic Impact of the Human Genome Project. Retrieved March 11, 2012 from http://www.battelle.org/spotlight/5-11-11_genome.aspx Brewer, E. (2012). CAP Twelve Years Later: How the "Rules" Have Changed. IEEE: 45(2), p. 23-29. Retrieved March 12, 2012 from DOI: 10.1109/MC.2012.37 Data.gov.uk. Opening up government. Retrieved March 9, 2012 from http://data.gov.uk/opendataconsultation/annex-1/economic-growth#_ftn9 Eddy, S. (2010). The next five years of computational genomics at NHGRI. Cryptogenomicon. Retrieved March 11, 2012 from http://selab.janelia.org/people/eddys/blog/?p=313 Fenner, M. (2011). ORCID: Unique Identifiers for Authors And Contributors. Information Standards Quarterly: 23(3). http://doi.org/gx4 Goddard, A., Wilson, N., Cryer, P., & Yamashita, G. (2011). Data Hosting Infrastructure for Primary Biodiversity Data. BMC Bioinformatics: 12(Suppl 15): S5. doi: 10.1186/1471-2105-12-S15-S5 Heath, T. and Bizer, C. (2011). Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology: 1(1), p. 1-136. Khan, Huda, Caruso, B. , Corson-Rikert, J., Dietrich, D., Lowe, B., and Steinhart, G. (2010). Using the Semantic Web Approach for Data Curation. Proceedings of the 6th International Conference on Digital Curation. December 6-8, 2010. Chicago, IL. http://hdl.handle.net/1813/22945 Lesk, Michael. Encouraging Scientific Data Use. The Fourth Paradigm, a Nature Network Blog. Retrieved March 11, 2012 from http://blogs.nature.com/fourthparadigm/2011/02/07/encouraging-scientific-data-use-michael. Obeid, J., Gabriel, D., Sanderson, I. (2010). Governance of Technology, Information and Policies. A biomedical research permissions ontology: cognitive and knowledge representation considerations. p. 9-13. Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., et al. (2011) Data sharing by scientists: practices and perceptions. PloS one: v. 6: e21101.
03/12/2012 at 10:55:15 PM Organization Sharp Informatics ALBUQUERQUE, NM Information rules must be understood by the Subject Matter Experts (SMEs) and the ultimate users that are responsible for the collection, processing and use of the collected data. If rules are also expressed in natural language and not only part of a complex graphical model and the technical terms that are used to express it, the results of the research will be improved.     Attachment #1: Issue: Subject Matter Experts should be accountable for the quality of the data being collected, stored, analyzed, and eventually distributed. How are the requirements for managing research data being validated by Subject Matter Experts? Are these requirements only being validated by the few medical experts who are also skilled in information technology? Shouldn't the best medical experts in their particular field be required to review and validate the requirements that will hold their research data? The problem is that these experts may not know enough about information technology models to tell when system requirements are incorrect. The Natural Language Modeling algorithm solves this problem by communicating with subject area experts through an easily understood and repeated process. Introduction: Information rules must be understood by the Subject Matter Experts (SMEs) and the ultimate users that are responsible for the collection, processing and use of the collected data. If rules are also expressed in natural language and not only part of a complex graphical model and the technical terms that are used to express it, the results of the research will be improved. Example: NIH is going to host a seminar on Pediatric Drug Delivery Systems at the NIH Campus in Bethesda, MD. Fifty-four doctors and industry experts are on an initial list of experts who have been invited to attend and will be directed to the registration form on the web. NICHD application developers have proposed the following sign-up form to enroll the potential attendees in the seminar: The local developers request that the proposed form design be validated by the NICHD hosts (i.e., Subject Matter Experts) before it is coded. The NLM algorithm is used for the validation. The algorithm first requests that the SMEs use the form to create a true sentence based on some of the example data. Then the SME establishes the form's requirements by answering a series of questions about this sentence. Sentence: Person with the email address of JSmith@jhh.edu has the first name of Jane and the last name of Smith. The algorithm requires the placement of the highlighted variables in a matrix. NLM then asks whether a single placeholder in the sentence can be changed and still have the sentence maintain its trueness. The Yes answer in the bottom row was obtained by asking the question: Given that the sentence 'Person with the email address of JSmith@jhh.edu has the first name of Jane and the last name of Smith.' Is true, can another last name exist (such as Jones) so that the sentence 'Person with the email address of JSmith@jhh.edu has the first name of Jane and the last name of Jones.' is also true. The SME thought of the example where Dr. Smith could also have the last name of Jones .As the NLM algorithm processes this information, it is determined that there are two names that are significant for the attendees. Smith is the Professional Last Name of the attendee at the seminar, but Jones is the Government Issued ID Last Name for the attendee to enter the NIH campus. The NLM algorithm also creates opportunities for SMEs to disagree on answers to these questions. The involved discussion will then determine the correct answer by finding examples that support one of the answers. By having this discussion before the design is finalized, development and/or maintenance costs are significantly curtailed due to reduced amount of rework. The form would be modified to reflect the newly identified requirements. This simple example shows a primary aspect of the NLM algorithm, namely that a SME without any experience in information technology can validate and correct the results of IT experts. Conclusion: The NLM algorithm is equally efficient at Creating requirements from forms and reports Comparing the rules stated in a data model with the SMEs' knowledge Extracting structure from large data sets NLM allows SMEs to validate and/or correct data system designs without any specific prior IT knowledge. The answers to the simple questions are nearly automatic and the discovery of the defined rules provides an educational experience to all of the other SMEs. This analysis can be used for any data problem and it will have similar results. The analysts do not need to have expertise in the subject area that is being reviewed. The SMEs are able to have a focused discussion about the applied rules and the results benefit both the resulting design and also the level of understanding about the rules that are being implemented. SMEs that have no IT experience can be held accountable for the design of the resulting system or for the design of standards. How many rules are involved in defining a life event of a child in a long term study? Again, why should these rules be defined by the set of information technology experts and not validated by the medical experts in the field that are directly involved in managing the health of children. There is no other competing algorithm that allows any Subject Matter Expert to validate a data model that has been created using any design approach (eg: ER, UML, OWL, etc.). An unsolicited paper is attached that states that the initial validation step in the NLM algorithm is the only way to enable a Subject Matter Expert to test a model. The complete NLM algorithm creates the correct rules in precise natural language that can then be expressed in a whichever design approach is preferred. A direct benefit of using Natural Language Modeling to do the validation is the ability of all participants to directly learn the rules involved in their subject area. In many cases, this education is more valuable to the program than the creation of precise requirements! Action: Please consider testing the NLM algorithm against some NIH IT requirements. SBIR projects of interest could target: Improving the ability of Subject Matter Experts to validate information designs or information standards Identifying data rules that exist in large sets of medical test data Attachment #2: PDF copy of article "Validation and Verification without Normalization" by Derek Ratkliff et al