WardDeb's site

Why workflow managers themselves are not enough.

I'm a big fan of using Snakemake to structure collaborative analyses with wetlab scientists and to make my code reproducible. While this works great overal, there are some aspects that remain challenging. First of all, projects go through iterations. Parameters, in - and output files can change over time. Even though Snakemake is 'smart enough' to re-run rules when something relevant has changed, the fact that I usually decouple working directories (which tend to have large data footprints) from results directories (smaller tables, figures, reports, ...) makes it inevitable that these go out of sync which each other. Additionally, deleting rules does not automatically remove their corresponding output files either. Secondly, and this is more of a pet peeve of mine, I'd prefer to run as little code as possible to generate all results of a project (read: a one-liner). While Snakemake already helps a lot with this, I figured I still spent too much time dealing with the first issue.

Enter pudding

I spent some time on coming up with a solution that deals with both issues, and came up with pudding, which essentially is just a cookiecutter template that wraps some python code around a Snakemake workflow. Be aware that it is quite opinionated in how a project should be structured (I'm referring here to the bioproject within pudding). For me this has worked quite well, and I use it for every new project I start. The template is used to create a project, and three parameters / variables need to be set in the run.py file that is generated:

WORKDIR: The directory where the workflow will run. Will contain raw, intermediate and final results.
RESULTSDIR: The directory where the final results will be copied to after a successful run.
SMK_PROFILE: A Snakemake profile. Technically this can omitted, but I have profiles set up on every system I use to run analyses, and I prefer to have my analysis being submitted through these profiles.

After initiation, one (or more) Snakemake workflows can be set up in the workflows / rules directory, and defined within run.py to be ran. Also within run.py, one can define 'result files', that should be kept in sync with the RESULTSDIR. This is done in the form of a list of tuples:

RFS = [
    ('results/*',)
    ('some/path/in/WORKDIR/important_figure.png', 'figures/important_figure.png'),
]

The first element of the tuple is the path to a file (or a glob pattern) within the WORKDIR. The second element is optional and defines the path within the RESULTSDIR where the file(s) should be copied to. If omitted, the folder structure for that specific file/pattern is preserved within the RESULTSDIR. Upon re-execution of run.py, all files defined as result files are validated (by their md5sum) and replaced if they are out-of-date or missing. If they no longer exist in the WORKDIR, they are deleted. Additionally, per run a README.md is created inside the RESULTSDIR, which contains:

Latest update: #timestamp
Code created by: #github-username
Results created with: #repository name
Repository code: #repository url
commit hash: #commit hash

This immediately solves the issues I mentioned above, but has the additional benefit that I typically make the RESULTSDIR accessible directly to my collaborators. This way they can always find the latest results, see when these were updated (i.e. after a meeting where we discussed some changes to be made) and more importantly, they can always trace back the exact repository state that was used to generate these results. Not only has this saved me a lot of time, but this also makes it very easy to 'wrap up' a project once it's done, and make the code and final results available for publication.

Pudding - my attempt at standardizing the code for bioinfo-wetlab collaborations.

Why workflow managers themselves are not enough.

Enter pudding