December 14, 2010

My first Reproducible Research Compendium

I have just completed my first Reproducible Research Compendium
“Analysis of the combined survey datasets from the American Red Cross
Tsunami Recovery Program Psycho-Social Project (adult community
It is basically all the reports and data from all the work I did on
evaluation psychosocial projects for the American Red Cross, bundled
But one of the subfolders also contains the scripts and everything
necessary to generate the final pdf report from the original datasets
from scratch, in the spirit of transparency and reproducible

So there is no copying-and-pasting of graphics from one program into
another. It is easy to make small but significant changes to the
analysis - for instance, to exclude one of the constituent surveys by
changing a line near the start of the script - and rerun the whole
thing and produce a new corresponding version of the report. No more
hunting about to find how you produced some particular graphic or

An article about computational science in a scientific publication is
not the scholarship itself, it is merely advertising of the
scholarship. The actual scholarship is the complete software
development environment and the complete set of instructions which
generated the figures.”
—D. Donoho

This approach has the following advantages:

• making it easier for me to return to the data and analyses in the
future and repeat or extend them

• making it easier for ARC to do the same without having to contact me

• enabling other researchers to repeat and verify these findings
themselves, even automatically if they desire.

• Ensuring complete transparency of the results

Concretely, this means that the original SPSS files as delivered by
the agencies are not changed at all. All recoding, data cleaning,
omission of cases etc is carried out in syntax. In fact the report
document itself — tables, graphics, statistics mentioned within the
text is produced entirely by the following procedure:

A word processing document (“source file”) is prepared which is
essentially the final report complete with introduction, chapter
headings, commentary etc together with blocks of syntax where
statistical results are required - in particular tables, and graphics
and inline results.

A single syntax file is run which takes the source file and creates a
second document, the present report, which is identical to the source
file except that the blocks of syntax are replaced by the results of
the syntax (tables, graphics, etc.). So there is neither any
cutting-and-pasting or editing of data in the data files and nor is
there, for example, any manual editing of table data or graphics.

So at each point in this report at which data preparation is
discussed, the interested reader will find the corresponding syntax at
the corresponding point in the source file which actually conducts the
corresponding data preparation. And at each point in this report at
which tables, graphics etc are displayed, the interested reader will
find the syntax at the corresponding point in the source file which
actually constructs those tables and graphics.

So the source document and datasets are available to anyone interested
who can then repeat these calculations, see exactly how they are
arrived and, and can extend the analyses at will.

Unfortunately, to the best of my knowledge the statistics program most
familiar to social scientists, SPSS, does not fulfill all of these
requirements, in particular it cannot produce a complete report
automatically. So the work is carried out using the package Sweave for
the open-source statistics program R. But intermediate datasets in
SPSS format including all recoded and calculated variables are also
provided additionally, so that as much as possible of the above can
also be accomplished with SPSS.

In detail, the original word processing file is written using the free
program Lyx ( which is available for Windows, Mac and
Linux, which is transformed into final pdf report - using the R
statistics engine. If you open the source file in Lyx you can see all
the R commands which are embedded in the text and which produce the
tables, etc in the pdf file.

evaluation R research

Previous post
Keeping R libraries in sync between different computers using Dropbox We have a few computers including laptops in our network which all use R ( for statistics. We use Dropbox to keep all our files in
Next post
How do you explain reproducible research to clients? Most of the statistics work I do now is reproducible research - this can offer a big advantage for clients but of course that doesn’t necessarily

This blog by Steve Powell is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, syndicated on r-bloggers and powered by Blot.
Privacy Policy