API Guide¶
Reference for the Python API. See Getting Started for introductory examples.
slide2vec exposes two main workflows:
direct in-memory embedding with
Model.embed_slide()/Model.embed_slides()artifact generation with
Pipeline.run()
EmbeddedSlide¶
Model.embed_slide() and Model.embed_slides() return
EmbeddedSlide objects:
- class slide2vec.EmbeddedSlide(*, sample_id, tile_embeddings, slide_embedding, x, y, tile_size_lv0, image_path, mask_path=None, annotation=None, num_tiles=None, mask_preview_path=None, tiling_preview_path=None, latents=None)¶
Bases:
objectIn-memory result of embedding a single slide.
- tile_embeddings: Any¶
Tile embeddings —
torch.Tensorof shape(N, D).
- slide_embedding: Any | None¶
Slide-level embedding —
torch.Tensorof shape(D,)for slide-level encoders;Nonefor tile-only encoders.
- annotation: str | None = None¶
Annotation class this bag of tiles was sampled for.
"tissue"for the default tissue-only path,"merged"for the union output mode, or the class name (e.g."tumor") when annotation-aware sampling fans a slide out into one bag per class. See the annotation-aware sampling documentation.
PreprocessingConfig¶
- class slide2vec.PreprocessingConfig(*, backend='auto', requested_spacing_um=None, requested_tile_size_px=None, requested_region_size_px=None, region_tile_multiple=None, tolerance=0.05, overlap=0.0, read_coordinates_from=None, read_tiles_from=None, on_the_fly=True, gpu_decode=False, adaptive_batching=False, use_supertiles=True, jpeg_backend='turbojpeg', num_cucim_workers=4, resume=False, segmentation=<factory>, filtering=<factory>, preview=<factory>, masks=<factory>, independent_sampling=True)
Bases:
objectConfiguration for slide tiling and preprocessing.
- backend: str = 'auto'
Slide reading backend.
"auto"tries cucim → openslide → vips in order. Explicit choices:"cucim","openslide","vips","asap".
- requested_spacing_um: float | None = None
Target spacing in µm/px. Resolved from the model preset when
None.
- requested_tile_size_px: int | None = None
Tile side length in pixels at requested_spacing_um. Resolved from the model preset when
None.
- requested_region_size_px: int | None = None
Parent region side length in pixels (hierarchical mode). Auto-derived as
requested_tile_size_px × region_tile_multiplewhenNone.
- region_tile_multiple: int | None = None
Region grid width/height in tiles (e.g.
6→ 6×6 = 36 tiles per region). Enables hierarchical extraction when set; must be ≥ 2.
- tolerance: float = 0.05
Relative spacing tolerance for pyramid level selection (default
0.05).
- overlap: float = 0.0
Fractional tile overlap (
0.0= no overlap).
- read_coordinates_from: Path | None = None
Directory containing pre-extracted tile coordinates to reuse, skipping tiling.
- read_tiles_from: Path | None = None
Directory containing pre-extracted tile images to skip the tiling step entirely.
- on_the_fly: bool = True
Read and decode tiles on demand rather than pre-loading into memory.
- gpu_decode: bool = False
Decode tiles on the GPU via CuCIM / nvImageCodec when
True.
- adaptive_batching: bool = False
Dynamically adjust batch size based on tile count.
- use_supertiles: bool = True
Group adjacent tiles into supertile batches for faster I/O.
- jpeg_backend: str = 'turbojpeg'
JPEG decode library —
"turbojpeg"(default) or"pillow".
- num_cucim_workers: int = 4
Number of CuCIM reader threads.
- resume: bool = False
Skip slides already present in the output directory when
True.
- segmentation: dict[str, Any]
method,downsample,sam2_device. See Preprocessing for details.- Type:
Forwarded to hs2p segmentation config. Supported keys
- preview: dict[str, Any]
Controls whether hs2p writes mask and tiling preview images. Keys:
save_mask_preview,save_tiling_preview,downsample.
- masks: dict[str, Any]
Annotation-mask vocabulary forwarded to hs2p’s sampling resolver. Keys:
output_mode,pixel_mapping,colors,min_coverage. A partial mapping is deep-merged overDEFAULT_MASKS, so callers only state what they override (e.g.{"min_coverage": {"tissue": 0.1}}). The default{background, tissue}block is plain tissue tiling;min_coverage.tissueis the single source of truth for the tissue threshold.
- independent_sampling: bool = True
When annotation sampling is active, tile each class independently (
True) vs jointly across classes (False).
For a full breakdown of backends, segmentation methods, and preview options, see Preprocessing.
ExecutionOptions¶
- class slide2vec.ExecutionOptions(*, output_dir=None, output_format='pt', batch_size=32, num_workers_per_gpu=None, num_preprocessing_workers=None, num_gpus=None, precision=None, prefetch_factor=4, save_tile_embeddings=False, save_slide_embeddings=False, save_latents=False)¶
Bases:
objectRuntime execution and output settings.
- num_workers_per_gpu: int | None = None¶
DataLoader worker count per GPU rank.
Nonemeans auto (capped by CPU / SLURM limit, then split across the resolved GPU count).
- num_preprocessing_workers: int | None = None¶
Tiling worker count.
Nonemeans auto (capped by CPU / SLURM limit).
- precision: str | None = None¶
Forward-pass dtype —
"fp16","bf16","fp32", orNone(auto-determined from the model preset).
- save_tile_embeddings: bool = False¶
Persist tile embeddings to disk when running a slide-level model.
Patient-level embedding¶
For patient-level models, use Model.embed_patient() for a single patient
or Model.embed_patients() for a batch.
Single patient¶
from slide2vec import Model
model = Model.from_preset("moozy")
result = model.embed_patient(
["/data/slide_1a.svs", "/data/slide_1b.svs"],
patient_id="patient_1",
)
print(result.patient_id) # "patient_1"
print(result.patient_embedding.shape) # torch.Size([768])
print(result.slide_embeddings) # {"slide_1a": tensor, "slide_1b": tensor}
Multiple patients¶
results = model.embed_patients(
[
{"sample_id": "slide_1a", "image_path": "/data/slide_1a.svs", "patient_id": "patient_1"},
{"sample_id": "slide_1b", "image_path": "/data/slide_1b.svs", "patient_id": "patient_1"},
{"sample_id": "slide_2a", "image_path": "/data/slide_2a.svs", "patient_id": "patient_2"},
]
)
for r in results:
print(r.patient_id, r.patient_embedding.shape)
embed_patients(...) returns one EmbeddedPatient per unique patient,
ordered by first appearance.
- class slide2vec.EmbeddedPatient(*, patient_id, patient_embedding, slide_embeddings)¶
Bases:
objectIn-memory result of embedding a single patient.
- patient_embedding: Any¶
Aggregated patient embedding —
torch.Tensorof shape(D,).
- slide_embeddings: dict[str, Any]¶
Slide-level embeddings keyed by
sample_id— each atorch.Tensorof shape(D,).
Hierarchical Feature Extraction¶
Enable hierarchical mode by setting region_tile_multiple in
PreprocessingConfig:
preprocessing = PreprocessingConfig(
requested_spacing_um=0.5,
requested_tile_size_px=224,
region_tile_multiple=6, # 6×6 = 36 tiles per region
)
The tile embeddings tensor will have shape (R, T, D) instead of (N, D).
See Hierarchical Features for the full explanation.
Dense Tile Feature Extraction¶
Some tile encoders can return the spatial grid of ViT patch-token features instead of a single pooled vector per tile. This is useful for dense downstream tasks where patch-token features must stay registered to the input tile.
Dense extraction is a low-level encoder API:
get_dense_transform()applies the encoder’s photometric normalization without resize or center-crop, so tile geometry is preserved.encode_tiles_dense(batch)accepts a normalized(B, C, H, W)tensor and returns(B, d, h, w).handware resolved from the input size and encoder patch size (for example, a 224 px tile with an 8 px patch size returns a 28 x 28 grid).
Example:
import torch
from PIL import Image
from slide2vec.encoders import encoder_registry
encoder = encoder_registry.require("lunit")().to("cuda")
transform = encoder.get_dense_transform()
tile = Image.open("/data/tile.png").convert("RGB")
batch = transform(tile).unsqueeze(0).to(encoder.device)
with torch.no_grad():
dense = encoder.encode_tiles_dense(batch)
print(dense.shape) # torch.Size([1, 384, 28, 28]) for a 224 px Lunit tile
The dense transform deliberately does not resize, crop, or pad. The input
height and width passed to encode_tiles_dense must be divisible by the
encoder patch size, unless the specific encoder is pinned to a native input
size. Unsupported encoders raise NotImplementedError.
For H-Optimus encoders, non-native dense extraction requires opting into the variable-size model setting:
encoder = encoder_registry.require("h-optimus-0")(
dynamic_img_size=True,
allow_non_recommended_settings=True,
).to("cuda")
Dense Attention Map Extraction¶
Most ViT tile encoders can also return their frozen per-head prefix-token
self-attention as a dense spatial grid. A frozen ViT’s CLS-token attention
doubles as a per-pixel feature (Ramchandani et al.,
arXiv:2602.18747); this is the attention
analog of encode_tiles_dense and reuses the same get_dense_transform()
(normalization only, geometry preserved).
encode_tiles_attention(batch, *, blocks=(-1,), include_registers=False)accepts a normalized(B, C, H, W)tensor and returns(B, K, h, w).K = len(blocks) * (1 + M·include_registers) * nh, wherenhis the head count andMthe model’s register-token count (0for models without registers). Each channel is one prefix-token query row’s attention over the patch grid for one head — heads are never reduced.Channels are stacked in the deterministic order
[block][cls, reg…][head](block outer, in the order requested; then CLS, then any register tokens; head innermost). The CLS block (the firstnhchannels of each block) does not depend oninclude_registers— registers only append channels.blocksselects transformer blocks (negative indices count from the end);include_registersadds the register-token query rows (Darcet et al.) as extra channels for models that carry them (e.g. Hibou).
Example:
import torch
from PIL import Image
from slide2vec.encoders import encoder_registry
encoder = encoder_registry.require("lunit")().to("cuda")
transform = encoder.get_dense_transform()
tile = Image.open("/data/tile.png").convert("RGB")
batch = transform(tile).unsqueeze(0).to(encoder.device)
with torch.no_grad():
attn = encoder.encode_tiles_attention(batch) # last block, CLS only
print(attn.shape) # (1, nh, 28, 28) for a 224 px Lunit tile
Each value is a softmax weight: a slice of one query row over the patch keys, so
values are non-negative and a channel’s spatial sum is <= 1 (the prefix-token
key columns carry the remaining mass). As with dense extraction, the input must
be divisible by the encoder patch size, and unsupported encoders raise
NotImplementedError.
Implementation note: timm ViTs run a fused SDPA kernel that never materializes
the attention matrix, so it is recomputed from each block’s own projection
(bit-equivalent to the weights the fused kernel applies). HuggingFace encoders
read the weights via output_attentions=True, but modern transformers
default to an SDPA implementation that silently ignores that flag (it warns and
returns no attentions); extraction therefore temporarily switches the model to
the eager attention implementation for the forward pass and restores the
previous setting afterwards.
Pipeline¶
Use Pipeline for manifest-driven batch processing and disk
outputs:
from slide2vec import ExecutionOptions, Model, Pipeline, PreprocessingConfig
model = Model.from_preset("virchow2")
pipeline = Pipeline(
model=model,
preprocessing=PreprocessingConfig(
requested_spacing_um=0.5,
requested_tile_size_px=224,
masks={"min_coverage": {"tissue": 0.1}},
),
execution=ExecutionOptions(output_dir="outputs/demo", num_gpus=2),
)
result = pipeline.run(manifest_path="/path/to/slides.csv")
See Input Manifest for the full manifest schema.
Pipeline.run(...) returns a RunResult:
- class slide2vec.RunResult(*, tile_artifacts, hierarchical_artifacts, slide_artifacts, patient_artifacts=<factory>, process_list_path=None)¶
Bases:
objectReturn value of
Pipeline.run().- hierarchical_artifacts: list[HierarchicalEmbeddingArtifact]¶
Hierarchical embedding artifacts; empty when hierarchical mode is disabled.
See Output Layout for the full on-disk directory structure and file schemas.