Comments on "Data Intensive Scientific Discovery"

An interesting book has been published free online on the future of the scientific method and the role of computing, software, and information systems: Data Intensive Scientific Discovery
The ideas link in with the concerns of myself and others with reproducible research, data sharing, data analysis, and open publishing.

COMMENTS ON SELECTED ARTICLES

Carole Goble and David De Roure - The Impact of Workflow Tools on Data-centric Research

Goble and De Roure discuss the increasing proliferation of databases to pool scientific data collection. They assert that "preparation, management, and analysis of data are bottlenecks and also beyond the skills of many scientists." (p.138). They suggest that workflows are the answer to many of the challenges of data aggregation and analysis. Specifically, they define a workflow as "a precise description of a scientific procedure—a multi-step process to coordinate multiple tasks." In particular tasks in the workflow provide output that is the input to subsequent tasks. They describe workflows as specific software systems. They note that designing workflows requires expertise, and that facilities to share workflows provides a means of transferring best practice between experts and novices.
Many of the ideas discussed by Goble and De Roure overlap with my hopes for R. R is a statistical computing system that makes it relatively easy to construct a workflow. Output from one analysis becomes input to another analysis. I hope that over the coming years more researchers in my field of psychology will complete all analyses in R and share their code in some systematic way. The aim would be to both encourage sharing of both data and and the code to process data, analyse data, and generate output. This could then be used as a tool to train others on how to analyse a study of a particular type.

Herbert Van de Sompel and Carl Lagoze - All Aboard: Toward a Machine-Friendly Scholarly Communication System

Van de Sompel and Lagoze discuss ways that the principles of good scholarly publication of research articles (i.e.,"registration, certification, awareness, archiving, and rewarding") are being extended to the publication of datasets.
They mention ways that the information provided by citation information could be improved. In particular, they note work being done to disambiguate authors and efforts to determine the link between the citing and cited work. It would be particularly interesting to have such information when searching for research. Is the citation used to make a point? Is it used to support a method used? Is it cited in an unclear way? Is it a self-citation? Is the citation critiqued by the citing article?
They also discuss efforts to provide measures of usage. A great benefit of blog posting is that services like Google Analytics can provide quite detailed information about the number of people accessing posted material. When a journal article is published, it is very difficult to know how many people have read the article. If a researcher or other body is trying to assess the value provided by a publication, how can this be done without some idea of the number of people reading and being influenced by the article. It's also about feedback.

Jeromy Anglim's Blog: Psychology and Statistics

Tuesday, October 20, 2009

Comments on "Data Intensive Scientific Discovery"

Disclaimer