WardDeb's site

Environments in bioinformatics analysis

A major pain in the neck when doing any bioinformatics analysis is maintaining all different software and mainly their dependencies. Virtual environments and especially Conda have made this substantially easier, but more often then not multiple environments are needed even within a single project. Workflow managers (I personally prefer Snakemake, but I recon the story for Nextflow is similar) made dealing with multiple environments a bit easier by automating the installation, activation and updating of the environments with simple directives.

How I typically deal with environments in my day-to-day work

A new project gets its own git(hub) repository, with a 'top-level' conda environment. The top-level environment typically looks something like this:

channels:
  - conda-forge
  - bioconda
dependencies:
  - python = 3.12
  - snakemake >= 8
  - seaborn >= 0.13
  - pandas >= 2
  - deeptools >= 3.5.5
  - graphviz = 8.1.0 
  - rich-click >= 1.8
  - samtools = 1.20
  - bedtools >= 2.31
  - pip
  - pip:
    - snakemake-executor-plugin-cluster-generic >= 1.0.9

and contains libraries and tools that I almost always use in my analyses. Environments that are specific to rules in the workflow get their own yaml files relative to the snakemake files. This works quite well, but has one major drawback: while snakemake automates updating the specific environments upon changes in the yaml files, this is not the case for the top-level environment. Additionally, I need to have the top-level environment installed on the system I want to run the analysis on. Pixi alleviates both these issues by automatically keeping the top-level environment up-to-date and making 'remembering' the top-level environment obsolete. The yaml file above translates to the following pixi.toml file:

[project]
name = "my-analysis"
channels = ["conda-forge", "bioconda"]

[dependencies]
python = "3.12"
snakemake = ">=8"
seaborn = ">=0.13"
pandas = ">=2"
deeptools = ">=3.5.5"
graphviz = "8.1.0"
rich-click = ">=1.8"
samtools = "1.20"
bedtools = ">=2.31"

[pypi-dependencies]
snakemake-executor-plugin-cluster-generic = ">=1.0.9"

While this is virtually identical to the conda yaml file, it reduces running the analysis from:

# Only on first time usage:
conda env create -f environment.yaml -n my-analysis
conda activate my-analysis
snakemake --use-conda -s path/to/Snakefile --configfile path/to/config.yaml --profile my-profile

pixi run snakemake -s path/to/Snakefile --configfile path/to/config.yaml --profile my-profile

This would recreate and update the environment as needed, without the need to explicitly do so. Additionally, it works fast, and can be configured to have environments installed in the same location as the project, keeping things nicely self-contained. Finally, the global environment (≈ conda's base environment) can consist of multiple environments, and make 'system-wide' installations easier and cleaner. I tried to keep my conda base environment as minimal as possible, but I'd be lying if I said that I didn't nuke it from time to time because of inevitably getting stuck in dependency hell due to too many packages needed in my base. Things like linters (Ruff), git related tools (pre-commit, git-lfs, gh), and other utilities (jq, bat, fzf, ...) can all be globally available though living in their own environments. I made the switch from conda recently and so far have no reason to look back.

An additional bonus is that pixi configurations can be encoded directly in pyproject.toml files as well, which similarly abolishes the need to have a virtual environment or conda environment per project. Since one can define specific tasks to run inside of a pixi environment as well, only adds to the convenience of using pixi on a day-to-day basis. I'd even consider getting rid of conda entirely if pixi would be directly supported to replace conda in snakemake, but unfortunately we're not quite there yet.