S5: The NIH Data Commons
The NIH Data Commons: A novel infrastructure to support data sharing
Senior Program Manager of Bioinformatics
The opportunities for broader scientific data sharing have been widely discussed, yet many challenges have prevented broad progress. Issues in data registration, transparency, ability to reuse, and trust are cited as lingering barriers (van Panhuis et al. 2014). The NIH Data Commons Pilot Projects is a consortium that is working towards a shared community space that will enable Findable, Accessible, Interoperable, and Reusable (FAIR) digital objects, such as data and analytical tools, in a controlled-access multi-cloud environment. 12 teams are working together and with the NIH and data stewards to implement this vision in the form of pilot projects over a four-year phase to address previously intractable obstacles towards a research commons. The Helium Team is working to build the multi-cloud infrastructure, lead the principle and policy development for data governance, security and ethics, and to enhance existing reporting standards across the data pipeline to support analysis of the Model Organism Databases and GTEx data to assess the impact of sexual dimorphism in human disease. The core of the data pipeline will register data and provide/resolve unique identifiers, support metadata ingestion and search, FAIR assessment, and provide API access. We strive to ease the burden on users by leveraging the existing Jupyter notebook environment with access to Galaxy and CWL pipelines, versioned and provided through container orchestration. Containerizing the computational environment with Docker allows easy environment recreation and built-in transparency. Critically, identification, provenance and metadata will auto-percolate through the system. Helium is piloting a data sharing functionality that will allow users to contribute through own data to the Helium common, and therefore taking advantage of this FAIR-compliant architecture. Helium’s work, together with the Data Commons, represents an advancement to overcome the traditional obstacles that have hindered data sharing.