Efforts for Data Sharing and DB Integration by DBCLS/NBDC

Hideki Hatanaka




Database Center for Life Science (DBCLS, since 2007) and National Bioscience Database Center (NBDC, since 2011) have been struggling with data sharing and integration of life science DBs in Japan for a decade. As steps for DB integration, they provide a catalog of 1600 DBs, a cross-search among 590 DBs, an archive of 120 DBs, and a repository including 40 datasets from human specimens. These organizations have also developed many services and tools to enhance usability of DBs, and have held series of international/domestic meetings for technology exchange, such as the annual BioHackathon (since 2008) and monthly SPARQLthon (since 2012).

DBs in the archive are under the Creative Commons licenses, and most DBs allow commercial use. Among the most popular DBs are Open TG-GATEs, the gene expression and toxicity DB after exposure to 170 chemical compounds, and its relative, Open TG-GATEs Pathological Image Database. They include 24,000 CEL files (66 GB) and 53,000 virtual slides (25 TB), respectively. The latter DB is equipped with an OpenSeadragon-based virtual slide viewer, where users can magnify images smoothly without heavy downloading. Toxicologists may be interested in other DBs in the archive, such as KEGG MEDICUS and KNApSAcK Family DBs.

As the next step to DB integration, last year NBDC and DBCLS launched NBDC RDF Portal, which provides RDF datasets produced by various institutions in accordance with the RDFizing database guideline. The Portal and the guideline owe a great deal to SPARQLthon. The Portal also maintains a SPARQL endpoint for the RDFized DBs, including four billion triples of wwPDB. Applications to query with GUI or natural languages are being tested for this endpoint and that from the DB archive.