Researchers develop new tool to help advance open science

2017-06-19
SAN FRANCISCO, June 18 (Xinhua) -- University of Washington (UW) and Microsoft Corp. researchers Maxim Grechkin, Bill Howe and Hoifung Poon have developed a new tool, known as Wide-Open, to help advance open science by automatically detecting datasets that are overdue for publication.

Available under an open source license on GitHub, a web-based open source distributed version control repository and Internet hosting service headquartered in San Francisco, Northern California, the Wide-Open system is designed to address the problem that researchers may forget to tell a repository to release the data when a paper is published.

Advances in genetic sequencing and other technologies have led to an explosion of biological data, and researchers routinely deposit data in online repositories.

As open data is deemed a vital pillar of open science, enabling other researchers to reproduce results and use the same datasets to make new discoveries, many scientific journals now require published authors to make the data underlying their findings publicly available. However, these policies often go unenforced.

The challenge, according to a news release from UW, is substantial. The U.S. National Center for Biotechnology Information (NCBI) Gene Expression Omnibus repository (GEO) alone contains 80,985 public datasets, spanning hundreds of tissue types in thousands of organisms, and the rapid growth in data makes it difficult for journals or data repositories to "police" whether datasets that should be made publicly available actually are.

In a recent article publishing in PLOS Biology, Grechkin and his team reported testing their tool on two popular data repositories maintained by the NCBI, namely GEO and the Sequence Read Archive (SRA).

Wide-Open identified a large number of overdue datasets, which spurred repository administrators to respond by releasing 400 datasets in one week.

"We developed a simple yet effective system that has already helped make hundreds of datasets public," lead author Grechkin, a doctoral student in the UW's Allen School of Computer Science & Engineering, was quoted in the news release as explaining. "Having an impartial and automated system enforce open data policies can help level the playing field among scientists and generate new opportunities for discovery."