Preprocessing¶

This page covers the full set of options available in PreprocessingConfig and how to configure them.

class slide2vec.PreprocessingConfig(*, backend='auto', requested_spacing_um=None, requested_tile_size_px=None, requested_region_size_px=None, region_tile_multiple=None, tolerance=0.05, overlap=0.0, read_coordinates_from=None, read_tiles_from=None, on_the_fly=True, gpu_decode=False, adaptive_batching=False, use_supertiles=True, jpeg_backend='turbojpeg', num_cucim_workers=4, resume=False, segmentation=<factory>, filtering=<factory>, preview=<factory>, masks=<factory>, independent_sampling=True)¶

Bases: object

Configuration for slide tiling and preprocessing.

backend: str = 'auto'¶: Slide reading backend. "auto" tries cucim → openslide → vips in order. Explicit choices: "cucim", "openslide", "vips", "asap".

requested_spacing_um: float | None = None¶: Target spacing in µm/px. Resolved from the model preset when None.

requested_tile_size_px: int | None = None¶: Tile side length in pixels at requested_spacing_um. Resolved from the model preset when None.

requested_region_size_px: int | None = None¶: Parent region side length in pixels (hierarchical mode). Auto-derived as requested_tile_size_px × region_tile_multiple when None.

region_tile_multiple: int | None = None¶: Region grid width/height in tiles (e.g. 6 → 6×6 = 36 tiles per region). Enables hierarchical extraction when set; must be ≥ 2.

tolerance: float = 0.05¶: Relative spacing tolerance for pyramid level selection (default 0.05).

overlap: float = 0.0¶: Fractional tile overlap (0.0 = no overlap).

read_coordinates_from: Path | None = None¶: Directory containing pre-extracted tile coordinates to reuse, skipping tiling.

read_tiles_from: Path | None = None¶: Directory containing pre-extracted tile images to skip the tiling step entirely.

on_the_fly: bool = True¶: Read and decode tiles on demand rather than pre-loading into memory.

gpu_decode: bool = False¶: Decode tiles on the GPU via CuCIM / nvImageCodec when True.

adaptive_batching: bool = False¶: Dynamically adjust batch size based on tile count.

use_supertiles: bool = True¶: Group adjacent tiles into supertile batches for faster I/O.

jpeg_backend: str = 'turbojpeg'¶: JPEG decode library — "turbojpeg" (default) or "pillow".

num_cucim_workers: int = 4¶: Number of CuCIM reader threads.

resume: bool = False¶: Skip slides already present in the output directory when True.

segmentation: dict[str, Any]¶

method, downsample, sam2_device. See Preprocessing for details.

Type:: Forwarded to hs2p segmentation config. Supported keys

filtering: dict[str, Any]¶: Forwarded to hs2p tile-filtering config.

preview: dict[str, Any]¶: Controls whether hs2p writes mask and tiling preview images. Keys: save_mask_preview, save_tiling_preview, downsample.

masks: dict[str, Any]¶: Annotation-mask vocabulary forwarded to hs2p’s sampling resolver. Keys: output_mode, pixel_mapping, colors, min_coverage. A partial mapping is deep-merged over DEFAULT_MASKS, so callers only state what they override (e.g. {"min_coverage": {"tissue": 0.1}}). The default {background, tissue} block is plain tissue tiling; min_coverage.tissue is the single source of truth for the tissue threshold.

independent_sampling: bool = True¶: When annotation sampling is active, tile each class independently (True) vs jointly across classes (False).

Backends¶

The backend field controls which slide-reading library is used:

"auto" — tries cucim → openslide → vips in order and picks the first available one
"cucim" — NVIDIA cuCIM (fastest for SVS/TIFF on GPU-equipped machines)
"openslide" — broad format support, CPU-only
"vips" — libvips, good for large TIFF files
"asap" — ASAP reader (requires separate installation)

Tissue Segmentation¶

segmentation is forwarded directly to hs2p‘s segmentation pipeline. The method key selects the algorithm:

hsv - heuristic based on the HSV colour space. Fast and robust for H&E slides.
otsu - thresholds the saturation channel using Otsu’s method.
threshold - applies a fixed saturation threshold.
sam2 - runs the AtlasPatch SAM2 tissue segmentation model on an internal 8.0 µm/px thumbnail. Requires the atlaspatch package and a compatible GPU. Additional key: sam2_device — device string for SAM2 inference (e.g. "cuda:0" or "cpu").

Example:

from slide2vec import Model, PreprocessingConfig

model = Model.from_preset("virchow2")
preprocessing = PreprocessingConfig(
    segmentation={"method": "sam2", "sam2_device": "cuda"},
)
embedded = model.embed_slide("/path/to/slide.svs", preprocessing=preprocessing)

Or in a YAML config:

tiling:
  seg_params:
      method: "sam2"
      sam2_device: "cuda"

Annotation-Aware Sampling¶

By default slide2vec tiles tissue: the masks vocabulary is the binary {background: 0, tissue: 1}, and embeddings cover every tissue tile. You can instead restrict tiling and feature extraction to specific annotated classes (tumor-only, stroma-only, …) by customizing the masks block. Any divergence from the default vocabulary opts the run into annotation-aware sampling; the plain tissue path is otherwise byte-for-byte unchanged.

In annotation mode the per-slide mask_path is read as a multi-label raster: each class occupies a distinct integer pixel value. The masks block maps that vocabulary and is deep-merged over the default, so you only state what you add:

pixel_mapping — {class_name: integer pixel value in the raster}
min_coverage — {class_name: float | null}; the minimum fraction of a tile that must be covered by that class to keep it. null means don’t sample that class. The tissue entry is the single source of truth for the tissue threshold.
colors — {class_name: [r, g, b] | null} used when rendering previews.
output_mode — per_annotation (one artifact set per sampled class) or merged (one set per slide over the union of tiles passing any class).
independent_sampling (a top-level tiling flag, exposed on PreprocessingConfig) — True samples each class against its own mask; False samples once over the union, then post-filters per class by coverage.

Tumor-only bag features. Add a tumor class, set its min_coverage, and disable tissue sampling with min_coverage.tissue: null. Only tumor tiles are sampled, so the returned EmbeddedSlide carries the tumor-only bag directly:

from slide2vec import Model, PreprocessingConfig

model = Model.from_preset("virchow2")
preprocessing = PreprocessingConfig(
    requested_spacing_um=0.5,
    requested_tile_size_px=224,
    masks={
        "pixel_mapping": {"tumor": 2},        # tumor pixels == 2 in the raster
        "min_coverage": {"tissue": None, "tumor": 0.5},
        "colors": {"tumor": [255, 0, 0]},     # preview color (optional)
    },
)

embedded = model.embed_slide(
    "/path/to/slide.svs",
    mask_path="/path/to/annotation_mask.tif",  # multi-label raster
    preprocessing=preprocessing,
)

assert embedded.annotation == "tumor"  # every bag is stamped with its label
tumor_bag = embedded.tile_embeddings   # shape (N_tumor, D) — tumor tiles only

Because this run produces exactly one bag, bare embed_slide(...) returns that single EmbeddedSlide directly. You can also name the class explicitly with the annotation selector — embed_slide(..., annotation="tumor") returns the same single bag, and requesting a class the run did not produce raises a ValueError listing the available bags.

The mask_path argument (or the mask_path column in a manifest) is required in annotation mode — it is the raster the class vocabulary indexes into.

Multiple classes per slide. To extract several classes in one run, give each its own min_coverage and use a Pipeline so the per-class artifacts are persisted and namespaced on disk under a <class>/ subdirectory (e.g. tile_embeddings/tumor/<sample_id>.pt and tile_embeddings/stroma/<sample_id>.pt):

from slide2vec import Model, Pipeline, PreprocessingConfig, ExecutionOptions

pipeline = Pipeline(
    model=Model.from_preset("virchow2"),
    preprocessing=PreprocessingConfig(
        requested_spacing_um=0.5,
        requested_tile_size_px=224,
        masks={
            "pixel_mapping": {"tumor": 2, "stroma": 3},
            "min_coverage": {"tissue": None, "tumor": 0.5, "stroma": 0.5},
        },
    ),
    execution=ExecutionOptions(output_dir="outputs/run"),
)

# The manifest's mask_path column points at each slide's multi-label raster.
result = pipeline.run(manifest_path="/path/to/slides.csv")

This fans out per (sample_id, annotation) across tile, slide, and hierarchical embeddings — and across multiple GPUs — recording each class’s own feature_path in process_list.csv. See Output Layout for the namespaced directory structure.

The merged output mode. With masks={"output_mode": "merged"} the run does not fan out per class: it samples the union of tiles passing any class and produces one bag per slide. That bag is labelled "merged" rather than a class name — its annotation is "merged", the inner embed_slides key is "merged", and on disk it lands at the flat output root (tile_embeddings/<sample_id>.pt) with no <class>/ subdirectory, exactly like the default tissue case:

results = model.embed_slides(slides, ...)   # masks with output_mode "merged"
merged = results["slide-1"]["merged"]
assert merged.annotation == "merged"        # not "tumor"/"stroma"

# Single bag per slide, so embed_slide returns it directly:
merged = model.embed_slide(slide, ...)      # output_mode "merged"
assert merged.annotation == "merged"

A named class ("tumor") is keyed by its class name and namespaced under <class>/ on disk; the "merged" and "tissue" labels are keyed by that label and sit at the flat root.

To work with several classes in memory instead of on disk, call embed_slides (not embed_slide). It returns a nested mapping, {sample_id: {label: EmbeddedSlide}} — the outer key is the sample_id and the inner key is each bag’s informative annotation label (a class name, "tissue", or "merged"; never None). Every EmbeddedSlide is also stamped with its annotation, so you can tell the bags apart by inner key or by attribute:

results = model.embed_slides(
    [{"sample_id": "slide-1",
      "image_path": "/path/to/slide.svs",
      "mask_path": "/path/to/annotation_mask.tif"}],
    preprocessing=preprocessing,   # masks with tumor + stroma, as above
)

# Index straight in by slide, then by class.
tumor_bag = results["slide-1"]["tumor"].tile_embeddings    # shape (N_tumor, D)
stroma_bag = results["slide-1"]["stroma"].tile_embeddings  # shape (N_stroma, D)

assert results["slide-1"]["tumor"].annotation == "tumor"
assert results["slide-1"]["stroma"].annotation == "stroma"

Pass annotations=[...] to restrict the inner keys to the classes you care about; omit it to receive every bag the run produced:

results = model.embed_slides(slides, annotations=["tumor"])
results["slide-1"].keys()  # dict_keys(['tumor']) — stroma is dropped

Selecting bags with embed_slide. embed_slide returns one bag (or a list of bags) for a single slide via the same annotation selector:

# One class → one EmbeddedSlide.
tumor = model.embed_slide(slide, annotation="tumor")
assert tumor.annotation == "tumor"

# A list of classes → a list of EmbeddedSlide in the requested order.
tumor, stroma = model.embed_slide(slide, annotation=["tumor", "stroma"])
assert [b.annotation for b in (tumor, stroma)] == ["tumor", "stroma"]

Bare embed_slide(slide) (no annotation) returns the single bag when the run produced exactly one; if the run fanned the slide out into several bags it raises a ValueError naming the available bags and directing you to embed_slides, so per-class results are never silently dropped.

Preview Images¶

slide2vec can write a tissue mask preview and a tiling preview for each slide. These are particularly useful for quality control. Both are disabled by default. Enable them via the preview dict:

preprocessing = PreprocessingConfig(
    preview={
        "save_mask_preview": True,
        "save_tiling_preview": True,
        "downsample": 32,
    }
)

Preview images are written to <output_dir>/preview/mask/<sample_id>.jpg and <output_dir>/preview/tiling/<sample_id>.jpg. Their paths are also recorded in process_list.csv and on the returned EmbeddedSlide (mask_preview_path, tiling_preview_path).

When resuming a run, existing preview paths are preserved in process_list.csv for unchanged successful tiling artifacts if the preview files still exist on disk.