API Guide¶

Reference for the Python API. See Getting Started for introductory examples.

slide2vec exposes two main workflows:

direct in-memory embedding with Model.embed_slide() / Model.embed_slides()
artifact generation with Pipeline.run()

EmbeddedSlide¶

Model.embed_slide() and Model.embed_slides() return EmbeddedSlide objects:

class slide2vec.EmbeddedSlide(*, sample_id, tile_embeddings, slide_embedding, x, y, tile_size_lv0, image_path, mask_path=None, annotation=None, num_tiles=None, mask_preview_path=None, tiling_preview_path=None, latents=None)¶

Bases: object

In-memory result of embedding a single slide.

sample_id: str¶: Unique slide identifier.

tile_embeddings: Any¶: Tile embeddings — torch.Tensor of shape (N, D).

slide_embedding: Any | None¶: Slide-level embedding — torch.Tensor of shape (D,) for slide-level encoders; None for tile-only encoders.

x: Any¶: x coordinate (pixels at level 0) of each tile’s top-left corner — array of shape (N,).

y: Any¶: y coordinate (pixels at level 0) of each tile’s top-left corner — array of shape (N,).

tile_size_lv0: int¶: Tile side length in pixels at level 0.

image_path: Path¶: Path to the source slide file.

mask_path: Path | None = None¶: Path to the tissue mask used for tiling, if any.

annotation: str | None = None¶: Annotation class this bag of tiles was sampled for. "tissue" for the default tissue-only path, "merged" for the union output mode, or the class name (e.g. "tumor") when annotation-aware sampling fans a slide out into one bag per class. See the annotation-aware sampling documentation.

num_tiles: int | None = None¶: Number of tiles extracted from the slide.

mask_preview_path: Path | None = None¶: Path to the mask preview image, if generated.

tiling_preview_path: Path | None = None¶: Path to the tiling preview image, if generated.

latents: Any | None = None¶: Encoder latent representations when available; None otherwise.

PreprocessingConfig¶

class slide2vec.PreprocessingConfig(*, backend='auto', requested_spacing_um=None, requested_tile_size_px=None, requested_region_size_px=None, region_tile_multiple=None, tolerance=0.05, overlap=0.0, read_coordinates_from=None, read_tiles_from=None, on_the_fly=True, gpu_decode=False, adaptive_batching=False, use_supertiles=True, jpeg_backend='turbojpeg', num_cucim_workers=4, resume=False, segmentation=<factory>, filtering=<factory>, preview=<factory>, masks=<factory>, independent_sampling=True)

Bases: object

Configuration for slide tiling and preprocessing.

backend: str = 'auto': Slide reading backend. "auto" tries cucim → openslide → vips in order. Explicit choices: "cucim", "openslide", "vips", "asap".

requested_spacing_um: float | None = None: Target spacing in µm/px. Resolved from the model preset when None.

requested_tile_size_px: int | None = None: Tile side length in pixels at requested_spacing_um. Resolved from the model preset when None.

requested_region_size_px: int | None = None: Parent region side length in pixels (hierarchical mode). Auto-derived as requested_tile_size_px × region_tile_multiple when None.

region_tile_multiple: int | None = None: Region grid width/height in tiles (e.g. 6 → 6×6 = 36 tiles per region). Enables hierarchical extraction when set; must be ≥ 2.

tolerance: float = 0.05: Relative spacing tolerance for pyramid level selection (default 0.05).

overlap: float = 0.0: Fractional tile overlap (0.0 = no overlap).

read_coordinates_from: Path | None = None: Directory containing pre-extracted tile coordinates to reuse, skipping tiling.

read_tiles_from: Path | None = None: Directory containing pre-extracted tile images to skip the tiling step entirely.

on_the_fly: bool = True: Read and decode tiles on demand rather than pre-loading into memory.

gpu_decode: bool = False: Decode tiles on the GPU via CuCIM / nvImageCodec when True.

adaptive_batching: bool = False: Dynamically adjust batch size based on tile count.

use_supertiles: bool = True: Group adjacent tiles into supertile batches for faster I/O.

jpeg_backend: str = 'turbojpeg': JPEG decode library — "turbojpeg" (default) or "pillow".

num_cucim_workers: int = 4: Number of CuCIM reader threads.

resume: bool = False: Skip slides already present in the output directory when True.

segmentation: dict[str, Any]

method, downsample, sam2_device. See Preprocessing for details.

Type:: Forwarded to hs2p segmentation config. Supported keys

filtering: dict[str, Any]: Forwarded to hs2p tile-filtering config.

preview: dict[str, Any]: Controls whether hs2p writes mask and tiling preview images. Keys: save_mask_preview, save_tiling_preview, downsample.

masks: dict[str, Any]: Annotation-mask vocabulary forwarded to hs2p’s sampling resolver. Keys: output_mode, pixel_mapping, colors, min_coverage. A partial mapping is deep-merged over DEFAULT_MASKS, so callers only state what they override (e.g. {"min_coverage": {"tissue": 0.1}}). The default {background, tissue} block is plain tissue tiling; min_coverage.tissue is the single source of truth for the tissue threshold.

independent_sampling: bool = True: When annotation sampling is active, tile each class independently (True) vs jointly across classes (False).

For a full breakdown of backends, segmentation methods, and preview options, see Preprocessing.

ExecutionOptions¶

class slide2vec.ExecutionOptions(*, output_dir=None, output_format='pt', batch_size=32, num_workers_per_gpu=None, num_preprocessing_workers=None, num_gpus=None, precision=None, prefetch_factor=4, save_tile_embeddings=False, save_slide_embeddings=False, save_latents=False)¶

Bases: object

Runtime execution and output settings.

output_dir: Path | None = None¶: Directory where artifacts are written. Required for Pipeline runs.

output_format: str = 'pt'¶: Tensor serialization format — "pt" (PyTorch, default) or "npz" (NumPy).

batch_size: int = 32¶: Number of tiles per forward pass.

num_workers_per_gpu: int | None = None¶: DataLoader worker count per GPU rank. None means auto (capped by CPU / SLURM limit, then split across the resolved GPU count).

num_preprocessing_workers: int | None = None¶: Tiling worker count. None means auto (capped by CPU / SLURM limit).

num_gpus: int | None = None¶: Number of GPUs to use. None defaults to all available GPUs.

precision: str | None = None¶: Forward-pass dtype — "fp16", "bf16", "fp32", or None (auto-determined from the model preset).

prefetch_factor: int = 4¶: DataLoader prefetch queue depth per worker (default 4).

save_tile_embeddings: bool = False¶: Persist tile embeddings to disk when running a slide-level model.

save_slide_embeddings: bool = False¶: Persist slide embeddings to disk when running a patient-level model.

save_latents: bool = False¶: Persist encoder latent representations when available.

resolved_num_workers_per_gpu()¶

Return type:: int

Patient-level embedding¶

For patient-level models, use Model.embed_patient() for a single patient or Model.embed_patients() for a batch.

Single patient¶

from slide2vec import Model

model = Model.from_preset("moozy")
result = model.embed_patient(
    ["/data/slide_1a.svs", "/data/slide_1b.svs"],
    patient_id="patient_1",
)

print(result.patient_id)              # "patient_1"
print(result.patient_embedding.shape) # torch.Size([768])
print(result.slide_embeddings)        # {"slide_1a": tensor, "slide_1b": tensor}

Multiple patients¶

results = model.embed_patients(
    [
        {"sample_id": "slide_1a", "image_path": "/data/slide_1a.svs", "patient_id": "patient_1"},
        {"sample_id": "slide_1b", "image_path": "/data/slide_1b.svs", "patient_id": "patient_1"},
        {"sample_id": "slide_2a", "image_path": "/data/slide_2a.svs", "patient_id": "patient_2"},
    ]
)

for r in results:
    print(r.patient_id, r.patient_embedding.shape)

embed_patients(...) returns one EmbeddedPatient per unique patient, ordered by first appearance.

class slide2vec.EmbeddedPatient(*, patient_id, patient_embedding, slide_embeddings)¶

Bases: object

In-memory result of embedding a single patient.

patient_id: str¶: Unique patient identifier.

patient_embedding: Any¶: Aggregated patient embedding — torch.Tensor of shape (D,).

slide_embeddings: dict[str, Any]¶: Slide-level embeddings keyed by sample_id — each a torch.Tensor of shape (D,).

Hierarchical Feature Extraction¶

Enable hierarchical mode by setting region_tile_multiple in PreprocessingConfig:

preprocessing = PreprocessingConfig(
    requested_spacing_um=0.5,
    requested_tile_size_px=224,
    region_tile_multiple=6,   # 6×6 = 36 tiles per region
)

The tile embeddings tensor will have shape (R, T, D) instead of (N, D). See Hierarchical Features for the full explanation.

Dense Tile Feature Extraction¶

Some tile encoders can return the spatial grid of ViT patch-token features instead of a single pooled vector per tile. This is useful for dense downstream tasks where patch-token features must stay registered to the input tile.

Dense extraction is a low-level encoder API:

get_dense_transform() applies the encoder’s photometric normalization without resize or center-crop, so tile geometry is preserved.
encode_tiles_dense(batch) accepts a normalized (B, C, H, W) tensor and returns (B, d, h, w).
h and w are resolved from the input size and encoder patch size (for example, a 224 px tile with an 8 px patch size returns a 28 x 28 grid).

Example:

import torch
from PIL import Image

from slide2vec.encoders import encoder_registry

encoder = encoder_registry.require("lunit")().to("cuda")
transform = encoder.get_dense_transform()

tile = Image.open("/data/tile.png").convert("RGB")
batch = transform(tile).unsqueeze(0).to(encoder.device)

with torch.no_grad():
    dense = encoder.encode_tiles_dense(batch)

print(dense.shape)  # torch.Size([1, 384, 28, 28]) for a 224 px Lunit tile

The dense transform deliberately does not resize, crop, or pad. The input height and width passed to encode_tiles_dense must be divisible by the encoder patch size, unless the specific encoder is pinned to a native input size. Unsupported encoders raise NotImplementedError.

For H-Optimus encoders, non-native dense extraction requires opting into the variable-size model setting:

encoder = encoder_registry.require("h-optimus-0")(
    dynamic_img_size=True,
    allow_non_recommended_settings=True,
).to("cuda")

Dense Attention Map Extraction¶

Most ViT tile encoders can also return their frozen per-head prefix-token self-attention as a dense spatial grid. A frozen ViT’s CLS-token attention doubles as a per-pixel feature (Ramchandani et al., arXiv:2602.18747); this is the attention analog of encode_tiles_dense and reuses the same get_dense_transform() (normalization only, geometry preserved).

encode_tiles_attention(batch, *, blocks=(-1,), include_registers=False) accepts a normalized (B, C, H, W) tensor and returns (B, K, h, w).
K = len(blocks) * (1 + M·include_registers) * nh, where nh is the head count and M the model’s register-token count (0 for models without registers). Each channel is one prefix-token query row’s attention over the patch grid for one head — heads are never reduced.
Channels are stacked in the deterministic order [block][cls, reg…][head] (block outer, in the order requested; then CLS, then any register tokens; head innermost). The CLS block (the first nh channels of each block) does not depend on include_registers — registers only append channels.
blocks selects transformer blocks (negative indices count from the end); include_registers adds the register-token query rows (Darcet et al.) as extra channels for models that carry them (e.g. Hibou).

Example:

import torch
from PIL import Image

from slide2vec.encoders import encoder_registry

encoder = encoder_registry.require("lunit")().to("cuda")
transform = encoder.get_dense_transform()

tile = Image.open("/data/tile.png").convert("RGB")
batch = transform(tile).unsqueeze(0).to(encoder.device)

with torch.no_grad():
    attn = encoder.encode_tiles_attention(batch)  # last block, CLS only

print(attn.shape)  # (1, nh, 28, 28) for a 224 px Lunit tile

Each value is a softmax weight: a slice of one query row over the patch keys, so values are non-negative and a channel’s spatial sum is <= 1 (the prefix-token key columns carry the remaining mass). As with dense extraction, the input must be divisible by the encoder patch size, and unsupported encoders raise NotImplementedError.

Implementation note: timm ViTs run a fused SDPA kernel that never materializes the attention matrix, so it is recomputed from each block’s own projection (bit-equivalent to the weights the fused kernel applies). HuggingFace encoders read the weights via output_attentions=True, but modern transformers default to an SDPA implementation that silently ignores that flag (it warns and returns no attentions); extraction therefore temporarily switches the model to the eager attention implementation for the forward pass and restores the previous setting afterwards.

Pipeline¶

Use Pipeline for manifest-driven batch processing and disk outputs:

from slide2vec import ExecutionOptions, Model, Pipeline, PreprocessingConfig

model = Model.from_preset("virchow2")
pipeline = Pipeline(
    model=model,
    preprocessing=PreprocessingConfig(
        requested_spacing_um=0.5,
        requested_tile_size_px=224,
        masks={"min_coverage": {"tissue": 0.1}},
    ),
    execution=ExecutionOptions(output_dir="outputs/demo", num_gpus=2),
)

result = pipeline.run(manifest_path="/path/to/slides.csv")

See Input Manifest for the full manifest schema.

Pipeline.run(...) returns a RunResult:

class slide2vec.RunResult(*, tile_artifacts, hierarchical_artifacts, slide_artifacts, patient_artifacts=<factory>, process_list_path=None)¶

Bases: object

Return value of Pipeline.run().

tile_artifacts: list[TileEmbeddingArtifact]¶: Tile embedding artifacts written to disk.

hierarchical_artifacts: list[HierarchicalEmbeddingArtifact]¶: Hierarchical embedding artifacts; empty when hierarchical mode is disabled.

slide_artifacts: list[SlideEmbeddingArtifact]¶: Slide embedding artifacts written to disk.

patient_artifacts: list[PatientEmbeddingArtifact]¶: Patient embedding artifacts; empty when no patient-level model is used.

process_list_path: Path | None = None¶: Path to process_list.csv, which tracks processing status per sample.

See Output Layout for the full on-disk directory structure and file schemas.