A curated list of open-source packages I develop and maintain.

Packages


Model2Vec

State-of-the-art static embedding models distilled from sentence transformers, designed for extremely fast CPU inference. Model2Vec produces tiny models (as small as 4MB) that deliver high-quality text embeddings, processing thousands of texts per second.


SemHash

A lightweight, multimodal library for semantic deduplication and filtering of datasets based on embedding similarity. SemHash can be used to remove near-duplicates, detect overlap between splits, or clean large corpora before training. Text works out of the box, and images, audio, and other modalities are supported with custom encoders.


Pyversity

A fast, lightweight library for diversifying retrieval results using classical diversification strategies. Pyversity provides a unified API for methods such as MMR, MSD, DPP, COVER, and SSD, with NumPy as its only dependency.


Potion

Tiny state-of-the-art static embedding models built with Model2Vec. Potion models are optimized for speed and efficiency, with pre-trained models available for English, Multilingual, and retrieval tasks.


Vicinity

A unified nearest-neighbor search interface that provides a consistent API over multiple vector search backends. Vicinity is designed to make it easy to switch or compare ANN implementations without changing application code.


Agentcheck

A fast, read-only tool that scans your shell and reports what an AI agent could access: cloud IAM, API keys, Kubernetes, local tools, and more. Every finding is tagged by severity, and agentcheck can be used as a safety hook or integrated into CI/CD pipelines.


Tokenlearn

A pre-training method and tooling for learning compact static embeddings used in distillation pipelines. Tokenlearn focuses on efficiently learning token-level representations that transfer well to downstream static models.


Model2Vec-rs

A Rust implementation of Model2Vec for performance-critical and native Rust use cases. This package mirrors the Python version while emphasizing speed, memory efficiency, and Rust ecosystem integration.