Notice Number: NOT-OD-17-110
National Institutes of Health (NIH)
The National Institutes of Health (NIH) is seeking public comments regarding a proposed update to the access procedures for genomic summary results under the Genomic Data Sharing (GDS) Policy.1 Genomic summary results, also known as ‘aggregate genomic data’2 or ‘genomic summary statistics’ 3, are results from primary analyses of genomic research that convey information relevant to understanding genomic associations with traits or diseases across datasets rather than data specific to any one individual research participant. The goal of this proposed data management update is to provide access to genomic summary results through methods proportional to the risks and benefits posed by this type of information.
NIH’s mission is to seek fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability.4 Broad and responsible sharing of genomic summary results generated through the analysis of NIH-supported research promotes maximum public benefit from the federal research investment by providing information crucial to the interpretation and application of genomic data in research and clinical practice.
Genomic summary results are an analytic output derived from a study’s primary genomic data and are currently defined by the agency to include calculated summary statistics, such as genotype counts or frequencies, allele counts or frequencies, effect size estimates and standard errors, likelihoods, and p-values. Genomic summary results facilitate the interpretation of genomic variants that may (or may not) contribute to a disease or disorder of interest.
Public sharing of genomic summary results from large-scale genomic research has become crucial for advancing scientific and clinical discovery.5 However, for studies available through an NIH-designated data repository6, access to this information is currently only available through controlled access (see below for an explanation of ‘controlled access’). Based on input from the research community over the past few years (see below), NIH is proposing to allow broader access to genomic summary results from most studies subject to the NIH GDS Policy. Institutions submitting genomic data to NIH-designated data repositories will be expected to notify NIH of any studies for which there are particular sensitivities (i.e., studies including potentially stigmatizing traits, or with identifiable or isolated study populations). Access to genomic summary results from such datasets will remain under controlled access.
NIH is committed to safeguarding the interests of study participants and maintaining public trust in biomedical research. Therefore, NIH is seeking public feedback on this proposed data management update.
History of Access to Genomic Summary Results
In 2007, NIH issued a policy for sharing data generated through NIH-supported genome wide association studies (GWAS),7 and launched the Database for Genotypes and Phenotypes (dbGaP).8 In 2014, the GWAS Policy was subsumed under the NIH GDS Policy, which applies to all large-scale genomic data generated from NIH-funded research. Under both policies, dbGaP and other NIH-designated data repositories have provided data through a two-tiered system: 1) unrestricted access, which includes descriptions of available data, the research protocols or instruments used to collect it, and summary-level information about the data; and 2) controlled access, which provides individual-level genomic and phenotypic data for appropriate research purposes under terms that are consistent with the informed consent under which those data were collected.
Under the NIH GWAS Policy, genomic summary results were originally included among the summary-level information available through unrestricted access. However, a 2008 paper by Homer et al. demonstrated a statistical method with the potential to resolve an individual’s inclusion as a member of a research group (e.g., within a disease group) using genomic summary results.9 Notably, this method, as well as others that have followed using other genomic information types,10,11 requires independent access to a known individual’s whole genome data in order to predict statistical ‘matches’ to information within genomic summary results. NIH responded to this development by moving genomic summary results into controlled access portions of NIH-designated data repositories. The agency also stated its intent to further assess the risks and benefits associated with unrestricted access to this type of information in light of the new methodology.12 The basis for the 2008 data management change was that the matching of a known individual to a disease (or ‘case’) group within a research study might reveal unknown health or phenotype information not obvious from the independently acquired whole genome data.13
Although NIH maintains genomic summary results in dbGaP under the controlled-access model, the research community has since developed several highly utilized and valuable public data resources to share genomic summary results.14 In addition, this type of information continues to be publicly available as an element of published studies in the scientific literature. Despite the broad availability of genomic summary results, NIH is not aware of any reported examples to date of individuals being matched to participation in a research study beyond the research analyses designed to demonstrate the hypothetical possibility of such an unintended use.15
NIH Discussions Related to Genomic Summary Results Access
In 2012, the NIH held the Workshop entitled “Establishing a Central Resource of Data from Genome Sequencing Projects”16 to consider a wide scope of issues related to aggregating genomic data. During the discussions, workshop participants noted the value of genomic summary results to scientific and clinical discovery, and recommended that they be publicly available—when appropriate. Also in 2012, to enable access to genomic summary results from General Research Use17 studies through a single data access request, a compilation of genomic summary results from appropriate studies was made available through dbGaP.18,19 In addition to more efficient access to this type of information, the creation of the compilation ‘study’ also made it possible to reduce unnecessary access to individual-level genomic data since, at that time, the only way to obtain access to genomic summary results under the GDS Policy was in conjunction with the full data set.
In 2016, NHGRI convened a Workshop entitled “Sharing Aggregate Genomic Data” to explicitly re-consider the risks and benefits associated with access to and use of genomic summary results.20 Workshop participants highlighted the minimal risk associated with public access to genomic summary results and supported an open access model for most studies.21 Participants did note that alternate access models should be considered for sensitive studies where there may be heightened concerns.
To solicit broad input on the risks and benefits of different access models for genomic summary results, NIH included this topic in the February 2017 Request for Information (RFI) on
“Processes for dbGaP Data Submission, Access, and Management.”22 Public comments received suggest support for broader access to genomic summary results, especially under scenarios that include additional risk mitigation strategies for genomic summary results from sensitive studies.23
Proposed Update to Genomic Summary Results Access
To maximize public benefit from genomic information generated through NIH-supported research in a manner consistent with current scientific and ethical considerations,24,25,26,27 NIH will promote broad sharing of genomic summary results from most research studies with data held in an NIH-designated data repository through a new “rapid access” tier. Rapid access will enable access to appropriate genomic summary results after interested users affirm agreement with a statement regarding responsible use of the information (see below).
This proposed update to the GDS Policy’s data management practices will support NIH’s goals to promote scientific advances and protect research participants’ privacy interests by reducing the need
for users to request controlled access to individual-level genomic data, unless it is necessary to address specific research questions. In addition, the proposed data management change establishes an access model for genomic summary results that is proportional to the distinct risks associated with access to this type of information relative to the risks associated with access to individual-level genomic data.28
Genomic summary results to be made available will include those provided by a study’s investigator, if any, as well as summary statistics computed by the relevant NIH-designated data repository across all non-sensitive studies with data included in that repository (see below). Genomic summary results provided will include systematically computed statistics such as, but not limited to: genotype counts and frequencies; allele counts and frequencies; effect size estimates and standard errors; likelihoods; and p-values. These values may be defined and calculated using scientifically relevant subsets of research participants included within study populations (e.g., disease, trait-based, or control populations). Information on methods for computing any summary statistics provided by an NIH-designated data repository will be available through the repository’s website.
It is possible that privacy risks related to broad access to genomic summary results may be heightened for study populations from isolated geographic regions or with rare traits. It is also possible that certain study populations may be more vulnerable to group harm due to potential for stigma related to traits being studied or other participant protection concerns. In addition, for studies that include data on potentially stigmatizing traits, the outcomes of any privacy breach could conceivably cause greater harm to research participants than is likely under most circumstances. To address these types of sensitivities, institutions submitting datasets to NIH-designated data repositories may indicate in the data sharing plan and the Institutional Certification29 if genomic summary results from such studies should be provided only through controlled access.
For the purposes of this proposed data management update, examples of potentially stigmatizing traits are expected to include, but not be limited to: illicit drug or substance abuse; HIV/AIDS diagnosis; or sexual attitudes, preferences, or practices. Increased privacy risk or heightened risk of group harm is anticipated to stem from study populations that draw from, but are not limited to: rare disease communities; studies with small sample sizes; or isolated or identifiable study populations, such as indigenous populations or underrepresented ethnic groups.
To support awareness of the ethical responsibilities associated with responsible use of genomic information (including genomic summary results), NIH will develop informational resources to be made publicly available through relevant NIH-designated data repositories. Before genomic summary results are made accessible through the new ‘rapid access’ tier, users will affirm that they have reviewed the informational resources provided (see below).
Affirming Responsible Use of Genomic Summary Results
To promote responsible use of genomic summary results available through the rapid access mechanism, users will affirm their agreement to advance science or health through their use of the information. This affirmation will be achieved via a ‘click-through agreement’ with users to confirm their intent to use the genomic summary results provided responsibly by indicating that they:
1. Reviewed the informational resources available on NIH-designated data repositories describing appropriate uses of genomic data, including genomic summary results;
2. Will not attempt to re-identify or contact any individual or group within a study population, or generate information that could allow participants’ identities to be readily ascertained; and,
3. Will promote scientific research or health through any use of the genomic summary results.
Consistent with the expectations under the NIH GDS Policy, NIH expects that consent forms and the informed consent process for human genomic studies will clearly articulate the access plans for data and information generated through the study, including genomic summary results.
NIH expects consent processes and other information available to potential research participants to be transparent that participation in an NIH-supported study infers an acknowledgement that investigators may aggregate and analyze the data generated through the study. NIH expects that consent processes and other information explain that such analyses or other summaries of study information (including genomic summary results) will be shared in the scientific literature or through other public scientific resources, such as data sharing resources that provide broad or unrestricted access to the information.
NIH expects the proposed data management update for access to genomic summary results to be effective upon final publication of the update.
After the effective date of this data management update, NIH-funded investigators performing research that falls under the scope of the GDS Policy will be expected to indicate, in their Genomic Data Sharing Plan, if a study should be designated as sensitive for the purposes of access to genomic summary results. This determination should be confirmed in the Institutional Certification provided to the NIH.
For datasets submitted to, or for which data are already accessible through, NIH-designated data repositories prior to the effective date of this data management update, submitting institutions will have six months to indicate if genomic summary results from any study submitted by one of their investigators should be maintained in controlled access due to concerns about sensitivity of study information. It will be possible to request additional time to complete the assessment for a particular dataset. In such cases, the genomic summary results for that dataset will remain in controlled access until a final determination is received by the funding Institute or Center.
If a submitting institution confirms the appropriateness of broader access to genomic summary results from a particular study prior to the end of the six-month period, the information can be made available through the rapid access tier immediately
NIH is seeking public feedback regarding the proposed data management update to accessing genomic summary results from NIH-funded studies. NIH encourages comments from all stakeholders, and is especially interested in hearing from members of the general public, research participants, and/or the broader patient community. NIH is seeking overall, general comments on any aspect of the proposed access model. NIH is also seeking feedback on the following specific issues:
1. Risks and benefits of providing broad access to genomic summary results from most studies in NIH-designated data repositories utilizing the proposed rapid access mechanism and associated click-through agreement. Risks and benefits may relate to participant protection issues and/or scientific opportunity.
2. Risks and benefits of maintaining genomic summary results from studies designated by the submitting institution to include sensitive information in controlled access. Risks and benefits may relate to participant protection issues and/or scientific opportunity.
3. Appropriateness of the proposal for institutions submitting study data under the NIH GDS Policy to indicate which datasets should be designated as sensitive.
4. General comments on any other topic relevant to unrestricted, rapid, or controlled-access to genomic summary results from NIH-funded studies.
NIH intends to hold at least one public webinar on the proposed data management update to the GDS Policy, and may also utilize other opportunities to answer questions and receive feedback from stakeholders as they are identified.
Comments on the topic areas of interest should be submitted electronically to the following webpage: https://osp.od.nih.gov/gsr-rfi/ or mailed to: Office of Science Policy (OSP), National Institutes of Health, 6705 Rockledge Drive, Suite 750, Bethesda, MD 20892, or by fax to: 301-496-9839 by [October 20], 2017.
Comments received, including any personal information, will be posted without change after the close of the comment period to the NIH GDS website (https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing/). Please do not include any proprietary, classified, confidential, or sensitive information in your response. Please note that the United States Government will not pay for the preparation of any information submitted or for its use of that information.
NIH looks forward to your input and hope that you will share this RFI document with your colleagues. Updates to this document, if any, will be noted.
The Government reserves the right to use any non-proprietary technical information in summaries of the state of the science, and any resultant solicitation(s). The NIH may use information gathered by this RFI to inform development or modification of data sharing databases, websites, policies and practices, processes and procedures, and supporting documentation (e.g., guidance, FAQs).
NIH Genomic Data Sharing Policy. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-14-124.html.
2 Aggregate data are defined in the NIH GDS Policy as summary statistics compiled from multiple sources of individual-level data.
3 Summary statistics have been defined as calculated summary statistics, including genotype counts, allele frequencies, effect size estimates and standard errors, and p-values calculated from a study sample. https://www.genome.gov/pages/policyethics/genomicdata/aggdatareport.pdf.
4 NIH Mission and Goals. https://www.nih.gov/about-nih/what-we-do/mission-goals.
5 Lek, Monkol, et al. "Analysis of protein-coding genetic variation in 60,706 humans." Nature 536, 285–291 (18 August 2016). doi:10.1038/nature19057.
6 An NIH-designated data repository is any data repository maintained or supported by NIH either directly or through collaboration.
7 Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS). https://grants.nih.gov/grants/guide/notice-files/NOT-OD-07-088.html.
8 NIH Launches dbGaP, a Database of Genome Wide Association Studies. https://www.nih.gov/news-events/news-releases/nih-launches-dbgap-database-genome-wide-association-studies.
9 Homer N, Szelinger S, Redman M, et al. Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. Visscher PM, ed. PLoS Genetics. 2008;4(8):e1000167. doi:10.1371/journal.pgen.1000167.
10 Schadt, Eric E., Sangsoon Woo, and Ke Hao. "Bayesian method to predict individual SNP genotypes from gene expression data." Nature genetics 44.5 (2012): 603-608.
11 Im, Hae Kyung, et al. "On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy." The American Journal of Human Genetics 90.4 (2012): 591-598.
13 Zerhouni, E.A. and Nabel, E.G. Protecting aggregate genomic data. Science. 2008 October 3; 322(5898): 44. Published online 2008 September 4. doi: 10.1126/science.1165490.
14 The Genome Aggregation Database (gnomAD); Exome Variant Server; The Exome Aggregation Consortium (ExAC); Type 2 Diabetes Knowledge Portal (T2DKP); Michigan Imputation Server; AmbryShare; ClinVar, BRAVO (Browse All Variants Online, TOPMed’s WGS variant server).
15 Wendler DS, Rid A. Genetic Research on Biospecimens Poses Minimal Risk. Trends in genetics?: TIG. 2015;31(1):11-15. doi:10.1016/j.tig.2014.10.003.
16Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects. 2012. National Institutes of Health. https://www.genome.gov/27552142/workshop-on-establishing-a-central-resource-of-datafrom-genome-sequencing-projects/.
17NIH describes data available for general research use (GRU) as data with no use limitations or restrictions beyond limitations outlined in the Data Use Certification Agreement. See: https://osp.od.nih.gov/wp-content/uploads/NIH_PTC_in_Developing_DUL_Statements.pdf And see: https://osp.od.nih.gov/wp-content/uploads/standard_data_use_limitations.pdf.
18 Notice of New Process for Requesting dbGaP Access to Aggregate Genomic Data for General Research Use Purposes. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-12-136.html.
20 NHGRI Workshop on Sharing Aggregate Genomic Data. https://www.genome.gov/27566089/workshop-on-sharing-aggregate-genomic-data/workshop-on-sharing-aggregate-genomic-data/
21 Workshop Report on Sharing Aggregate Genomic Data. https://www.genome.gov/pages/policyethics/genomicdata/aggdatareport.pdf.
22 Request for Information on Processes for dbGaP Data Submission, Access, and Management. https://grants.nih.gov/grants/guide/notice-files/NOT-od-17-044.html.
24Erlich Y, Narayanan A. Routes for breaching and protecting genetic privacy. Nature reviews Genetics. 2014;15(6):409-421. doi:10.1038/nrg3723.
25Wendler DS, Rid A. Genetic Research on Biospecimens Poses Minimal Risk. Trends in Genetics?: TIG. 2015;31(1):11-15. doi:10.1016/j.tig.2014.10.003.
26 Sanderson, Saskia C., et al. "Public Attitudes toward Consent and Data Sharing in Biobank Research: A Large Multi-site Experimental Survey in the US." The American Journal of Human Genetics 100.3 (2017): 414-427.
27 Gutmann, A. W., et al. "Privacy and progress in whole genome sequencing." Presidential Committee for the Study of Bioethical 2012 (2012).
28 Craig, D. W., Goor, R., Wang, Z., Paschall, J., Ostell, J., Feolo, M., & Manolio, T. A. ‘Privacy for Summary Level Data’; in Assessing and Managing Risk when Sharing Aggregate Genetic Variant Data. Nature Reviews Genetics. 2011;12(10):730-736. doi:10.1038/nrg3067.
Please direct all inquiries to:
NIH Office of Science Policy