Preprocessing

This page covers the full set of options available in PreprocessingConfig and how to configure them.

class slide2vec.PreprocessingConfig(*, backend='auto', requested_spacing_um=None, requested_tile_size_px=None, requested_region_size_px=None, region_tile_multiple=None, tolerance=0.05, overlap=0.0, read_coordinates_from=None, read_tiles_from=None, on_the_fly=True, gpu_decode=False, adaptive_batching=False, use_supertiles=True, jpeg_backend='turbojpeg', num_cucim_workers=4, resume=False, segmentation=<factory>, filtering=<factory>, preview=<factory>, masks=<factory>, independent_sampling=True)

Bases: object

Configuration for slide tiling and preprocessing.

backend: str = 'auto'

Slide reading backend. "auto" tries cucim → openslide → vips in order. Explicit choices: "cucim", "openslide", "vips", "asap".

requested_spacing_um: float | None = None

Target spacing in µm/px. Resolved from the model preset when None.

requested_tile_size_px: int | None = None

Tile side length in pixels at requested_spacing_um. Resolved from the model preset when None.

requested_region_size_px: int | None = None

Parent region side length in pixels (hierarchical mode). Auto-derived as requested_tile_size_px × region_tile_multiple when None.

region_tile_multiple: int | None = None

Region grid width/height in tiles (e.g. 6 → 6×6 = 36 tiles per region). Enables hierarchical extraction when set; must be ≥ 2.

tolerance: float = 0.05

Relative spacing tolerance for pyramid level selection (default 0.05).

overlap: float = 0.0

Fractional tile overlap (0.0 = no overlap).

read_coordinates_from: Path | None = None

Directory containing pre-extracted tile coordinates to reuse, skipping tiling.

read_tiles_from: Path | None = None

Directory containing pre-extracted tile images to skip the tiling step entirely.

on_the_fly: bool = True

Read and decode tiles on demand rather than pre-loading into memory.

gpu_decode: bool = False

Decode tiles on the GPU via CuCIM / nvImageCodec when True.

adaptive_batching: bool = False

Dynamically adjust batch size based on tile count.

use_supertiles: bool = True

Group adjacent tiles into supertile batches for faster I/O.

jpeg_backend: str = 'turbojpeg'

JPEG decode library — "turbojpeg" (default) or "pillow".

num_cucim_workers: int = 4

Number of CuCIM reader threads.

resume: bool = False

Skip slides already present in the output directory when True.

segmentation: dict[str, Any]

method, downsample, sam2_device. See Preprocessing for details.

Type:

Forwarded to hs2p segmentation config. Supported keys

filtering: dict[str, Any]

Forwarded to hs2p tile-filtering config.

preview: dict[str, Any]

Controls whether hs2p writes mask and tiling preview images. Keys: save_mask_preview, save_tiling_preview, downsample.

masks: dict[str, Any]

Annotation-mask vocabulary forwarded to hs2p’s sampling resolver. Keys: output_mode, pixel_mapping, colors, min_coverage. A partial mapping is deep-merged over DEFAULT_MASKS, so callers only state what they override (e.g. {"min_coverage": {"tissue": 0.1}}). The default {background, tissue} block is plain tissue tiling; min_coverage.tissue is the single source of truth for the tissue threshold.

independent_sampling: bool = True

When annotation sampling is active, tile each class independently (True) vs jointly across classes (False).

Backends

The backend field controls which slide-reading library is used:

  • "auto" — tries cucim → openslide → vips in order and picks the first available one

  • "cucim" — NVIDIA cuCIM (fastest for SVS/TIFF on GPU-equipped machines)

  • "openslide" — broad format support, CPU-only

  • "vips" — libvips, good for large TIFF files

  • "asap" — ASAP reader (requires separate installation)

Tissue Segmentation

segmentation is forwarded directly to hs2p‘s segmentation pipeline. The method key selects the algorithm:

  • hsv - heuristic based on the HSV colour space. Fast and robust for H&E slides.

  • otsu - thresholds the saturation channel using Otsu’s method.

  • threshold - applies a fixed saturation threshold.

  • sam2 - runs the AtlasPatch SAM2 tissue segmentation model on an internal 8.0 µm/px thumbnail. Requires the atlaspatch package and a compatible GPU. Additional key: sam2_device — device string for SAM2 inference (e.g. "cuda:0" or "cpu").

Example:

from slide2vec import Model, PreprocessingConfig

model = Model.from_preset("virchow2")
preprocessing = PreprocessingConfig(
    segmentation={"method": "sam2", "sam2_device": "cuda"},
)
embedded = model.embed_slide("/path/to/slide.svs", preprocessing=preprocessing)

Or in a YAML config:

tiling:
  seg_params:
      method: "sam2"
      sam2_device: "cuda"

Annotation-Aware Sampling

By default slide2vec tiles tissue: the masks vocabulary is the binary {background: 0, tissue: 1}, and embeddings cover every tissue tile. You can instead restrict tiling and feature extraction to specific annotated classes (tumor-only, stroma-only, …) by customizing the masks block. Any divergence from the default vocabulary opts the run into annotation-aware sampling; the plain tissue path is otherwise byte-for-byte unchanged.

In annotation mode the per-slide mask_path is read as a multi-label raster: each class occupies a distinct integer pixel value. The masks block maps that vocabulary and is deep-merged over the default, so you only state what you add:

  • pixel_mapping{class_name: integer pixel value in the raster}

  • min_coverage{class_name: float | null}; the minimum fraction of a tile that must be covered by that class to keep it. null means don’t sample that class. The tissue entry is the single source of truth for the tissue threshold.

  • colors{class_name: [r, g, b] | null} used when rendering previews.

  • output_modeper_annotation (one artifact set per sampled class) or merged (one set per slide over the union of tiles passing any class).

  • independent_sampling (a top-level tiling flag, exposed on PreprocessingConfig) — True samples each class against its own mask; False samples once over the union, then post-filters per class by coverage.

Tumor-only bag features. Add a tumor class, set its min_coverage, and disable tissue sampling with min_coverage.tissue: null. Only tumor tiles are sampled, so the returned EmbeddedSlide carries the tumor-only bag directly:

from slide2vec import Model, PreprocessingConfig

model = Model.from_preset("virchow2")
preprocessing = PreprocessingConfig(
    requested_spacing_um=0.5,
    requested_tile_size_px=224,
    masks={
        "pixel_mapping": {"tumor": 2},        # tumor pixels == 2 in the raster
        "min_coverage": {"tissue": None, "tumor": 0.5},
        "colors": {"tumor": [255, 0, 0]},     # preview color (optional)
    },
)

embedded = model.embed_slide(
    "/path/to/slide.svs",
    mask_path="/path/to/annotation_mask.tif",  # multi-label raster
    preprocessing=preprocessing,
)

assert embedded.annotation == "tumor"  # every bag is stamped with its label
tumor_bag = embedded.tile_embeddings   # shape (N_tumor, D) — tumor tiles only

Because this run produces exactly one bag, bare embed_slide(...) returns that single EmbeddedSlide directly. You can also name the class explicitly with the annotation selector — embed_slide(..., annotation="tumor") returns the same single bag, and requesting a class the run did not produce raises a ValueError listing the available bags.

The mask_path argument (or the mask_path column in a manifest) is required in annotation mode — it is the raster the class vocabulary indexes into.

Multiple classes per slide. To extract several classes in one run, give each its own min_coverage and use a Pipeline so the per-class artifacts are persisted and namespaced on disk under a <class>/ subdirectory (e.g. tile_embeddings/tumor/<sample_id>.pt and tile_embeddings/stroma/<sample_id>.pt):

from slide2vec import Model, Pipeline, PreprocessingConfig, ExecutionOptions

pipeline = Pipeline(
    model=Model.from_preset("virchow2"),
    preprocessing=PreprocessingConfig(
        requested_spacing_um=0.5,
        requested_tile_size_px=224,
        masks={
            "pixel_mapping": {"tumor": 2, "stroma": 3},
            "min_coverage": {"tissue": None, "tumor": 0.5, "stroma": 0.5},
        },
    ),
    execution=ExecutionOptions(output_dir="outputs/run"),
)

# The manifest's mask_path column points at each slide's multi-label raster.
result = pipeline.run(manifest_path="/path/to/slides.csv")

This fans out per (sample_id, annotation) across tile, slide, and hierarchical embeddings — and across multiple GPUs — recording each class’s own feature_path in process_list.csv. See Output Layout for the namespaced directory structure.

The merged output mode. With masks={"output_mode": "merged"} the run does not fan out per class: it samples the union of tiles passing any class and produces one bag per slide. That bag is labelled "merged" rather than a class name — its annotation is "merged", the inner embed_slides key is "merged", and on disk it lands at the flat output root (tile_embeddings/<sample_id>.pt) with no <class>/ subdirectory, exactly like the default tissue case:

results = model.embed_slides(slides, ...)   # masks with output_mode "merged"
merged = results["slide-1"]["merged"]
assert merged.annotation == "merged"        # not "tumor"/"stroma"

# Single bag per slide, so embed_slide returns it directly:
merged = model.embed_slide(slide, ...)      # output_mode "merged"
assert merged.annotation == "merged"

A named class ("tumor") is keyed by its class name and namespaced under <class>/ on disk; the "merged" and "tissue" labels are keyed by that label and sit at the flat root.

To work with several classes in memory instead of on disk, call embed_slides (not embed_slide). It returns a nested mapping, {sample_id: {label: EmbeddedSlide}} — the outer key is the sample_id and the inner key is each bag’s informative annotation label (a class name, "tissue", or "merged"; never None). Every EmbeddedSlide is also stamped with its annotation, so you can tell the bags apart by inner key or by attribute:

results = model.embed_slides(
    [{"sample_id": "slide-1",
      "image_path": "/path/to/slide.svs",
      "mask_path": "/path/to/annotation_mask.tif"}],
    preprocessing=preprocessing,   # masks with tumor + stroma, as above
)

# Index straight in by slide, then by class.
tumor_bag = results["slide-1"]["tumor"].tile_embeddings    # shape (N_tumor, D)
stroma_bag = results["slide-1"]["stroma"].tile_embeddings  # shape (N_stroma, D)

assert results["slide-1"]["tumor"].annotation == "tumor"
assert results["slide-1"]["stroma"].annotation == "stroma"

Pass annotations=[...] to restrict the inner keys to the classes you care about; omit it to receive every bag the run produced:

results = model.embed_slides(slides, annotations=["tumor"])
results["slide-1"].keys()  # dict_keys(['tumor']) — stroma is dropped

Selecting bags with embed_slide. embed_slide returns one bag (or a list of bags) for a single slide via the same annotation selector:

# One class → one EmbeddedSlide.
tumor = model.embed_slide(slide, annotation="tumor")
assert tumor.annotation == "tumor"

# A list of classes → a list of EmbeddedSlide in the requested order.
tumor, stroma = model.embed_slide(slide, annotation=["tumor", "stroma"])
assert [b.annotation for b in (tumor, stroma)] == ["tumor", "stroma"]

Bare embed_slide(slide) (no annotation) returns the single bag when the run produced exactly one; if the run fanned the slide out into several bags it raises a ValueError naming the available bags and directing you to embed_slides, so per-class results are never silently dropped.

Preview Images

slide2vec can write a tissue mask preview and a tiling preview for each slide. These are particularly useful for quality control. Both are disabled by default. Enable them via the preview dict:

preprocessing = PreprocessingConfig(
    preview={
        "save_mask_preview": True,
        "save_tiling_preview": True,
        "downsample": 32,
    }
)

Preview images are written to <output_dir>/preview/mask/<sample_id>.jpg and <output_dir>/preview/tiling/<sample_id>.jpg. Their paths are also recorded in process_list.csv and on the returned EmbeddedSlide (mask_preview_path, tiling_preview_path).

When resuming a run, existing preview paths are preserved in process_list.csv for unchanged successful tiling artifacts if the preview files still exist on disk.