Truth in the World: Freeing the Dark Data of Failed Scientific Experiments

By Thomas Goetz

09.25.07

It's Time to Free the Dark Data of Failed Scientific Experiments

Photo: Mauricio Alejo

In 1981, the New England Journal of Medicine published a Harvard study that showed an unexpected link between drinking coffee and pancreatic cancer. As it happened, researchers were anticipating a connection between alcohol or tobacco and cancer. But according to the survey of several hundred patients, booze and cigarettes didn't seem to increase your risk. Then came a surprise: An incidental survey question suggested that coffee did increase the chances of pancreatic cancer. So that's what got published.

Those positive results, alas, were entirely anomalous; 20 years of follow-up research showed the coffee-cancer connection to be bunk. Nonetheless, it's a textbook example of so-called publication bias, where science gets skewed because only positive correlations see the light of day. After all, the surprising findings are what makes the news (and careers).

So what happens to all the research that doesn't yield a dramatic outcome — or, worse, the opposite of what researchers had hoped? It ends up stuffed in some lab drawer. The result is a vast body of squandered knowledge that represents a waste of resources and a drag on scientific progress. This information — call it dark data — must be set free.

For the past couple of years, there's been much talk about open access, the idea that more scientific publications should be freely available — not locked behind firewalls and subscriptions. Thanks to the Public Library of Science (PLoS) and other organizations, that notion is making headway. Liberating dark data takes this ethos one step further. It also makes many scientists deeply uncomfortable, because it calls for them to reveal their "failures." But in this data-intensive age, those apparent dead ends could be more important than the breakthroughs. After all, some of today's most compelling research efforts aren't one-off studies that eke out statistically significant results, they're meta-studies — studies of studies — that crunch data from dozens of sources, producing results that are much more likely to be true. What's more, your dead end may be another scientist's missing link, the elusive chunk of data they needed. Freeing up dark data could represent one of the biggest boons to research in decades, fueling advances in genetics, neuroscience, and biotech.

So why doesn't it happen? In part, it's a logistics problem: Advocating the release of dark data is one thing, but it's quite another to actually collect it, juggling different formats and standards. And, of course, there's the issue of storage. These days, an astronomical study of quasars or an ambitious bioinformatics project can generate several terabytes of data. Few have the capacity to store that, let alone analyze it.

Google, among others, is lending a hand with its Palimpsest project, offering to store and share monster-size data sets (making the data searchable isn't a part of the effort). As storage costs drop, similar data banks will emerge, along with format standards, and it should become ever easier to share results, good or bad.

Technology is actually the simple part. The tougher problem lies in the culture of science. More and more, research is funded by commercial entities, which deem any results proprietary. And even among fair-minded academics, the pressures of time, tender, and tenure can make openness an afterthought. If their research is successful, many academics guard their data like Gollum, wringing all the publication opportunities they can out of it over years. If the research doesn't pan out, there's a strong incentive to move on, ASAP, and a disincentive to linger in eddies that may not advance one's job prospects.
There are some islands of innovation. Since 2002, the Journal of Negative Results in Biomedicine has offered a peer-reviewed home to results that go negative or against the grain. Earlier this year, the journal Nature started Nature Precedings, a Web-based forum for prepublication research and unpublished manuscripts in biomedicine, chemistry, and the earth sciences. At Drexel University, chemist Jean-Claude Bradley practices "open notebook" science — chronicling his lab's work and sharing data via blog and wiki. And PLoS is planning an open repository for research and data that is other wise abandoned.

These are great first steps. But freeing dark data should be the norm, not the exception. Once the storage and format problems are solved, scientists will need easy ways to search and retrieve each other's data. (And there should be a simple way, perhaps via metadata, to make sure that work is always duly cited and acknowledged by others.) Congress should mandate that all federally funded research be disseminated, whatever the results.

Getting science comfortable with exposing its dark data is really just the beginning. Once you start looking for it, dark data is everywhere: It's locked away in out-of-print books and orphaned art, the stuff that Creative Commons and Google Book Search have been bringing to light. Speaking of which: Hey, Google! Know all those research projects your employees do that the company will never green-light? How about letting the rest of the world take a crack at them?

Deputy editor Thomas Goetz (thomas@wired.com) wrote about DNA diagnostics in issue 15.08.

Truth in the World

sexta-feira, 4 de novembro de 2011

Freeing the Dark Data of Failed Scientific Experiments

Nenhum comentário:

Postar um comentário

Seguidores