Summary
Head of AI Engineering at Springer Nature, where I lead a team building AI-driven solutions that improve scientific publishing. I specialize in recommender systems, information retrieval, and building practical ML systems that scale.
Co-founder of Minish, an open-source ML lab focused on efficient, eco-friendly models. Creator of Model2Vec, SemHash, Vicinity, and the Potion embedding models.
Experience
Springer Nature — Head of AI Engineering
Dec 2024 - Present • Groningen, Netherlands
Leading a team of engineers building AI-driven solutions for scientific publishing. Currently working on research discovery tools, compliance screening systems, and driving the adoption of generative AI across the organization. Responsible for team growth, technical strategy, and bringing ML products from research to production.
Springer Nature — Senior Machine Learning Engineer
Oct 2023 - Dec 2024 • Groningen
The Slimmer AI science division was acquired by Springer Nature after successfully working on ML products together for a number of years. I continued my role as technical lead of the recommenders team.
Minish — Co-founder
Sep 2024 - Present (part-time) • Groningen, Netherlands
Developing open-source ML software with a focus on efficiency and eco-friendly models. Currently working on Model2Vec, SemHash, Pyversity, Potion, Vicinity, and other packages. See Projects page for details.
GitHub org: MinishLab
Slimmer AI — Machine Learning Engineer
May 2021 - Oct 2023 • Groningen
Worked as a machine learning engineer and tech lead of the recommenders team, developing ML products in the scientific domain.
Gronalytica — Co-founder
Jun 2019 - May 2021 • Groningen, Netherlands
Developed a platform for automatic grading of pharmaceutical exams at pharmagrader.com. Funded by SNN.
iChoice4U BV — Data Scientist
Dec 2017 - May 2019 • UMCG, Groningen
Worked as a data scientist/research assistant on data from Pscribe.
Education
University of Groningen — Master’s degree, Artificial Intelligence
2018 - 2020
Graduated cum laude (9.5 thesis, 8.4 average grade).
Thesis title: “Quality Prediction of Scientific Documents Using Textual and Visual Content”
University of Groningen — Bachelor’s degree, Information Science
2015 - 2018
Open Source Projects
Model2Vec — A distillation framework for creating state-of-the-art static embeddings from Sentence Transformers. Encodes up to 20k embeddings/s on CPU with a 10% MTEB improvement over prior static embedding baselines.
SemHash — A package for semantic deduplication and filtering or large datasets baed on embedding similarity. Can be used to remove near-duplicates, detect overlap between splits, or clean large corpora before training.
Pyversity — A unified interface for optimized diversity algorithms in Python, designed for improving search & retrieval results.
Potion models — Tiny state-of-the-art static embedding models. Pre-trained models include English, Multilingual, and retrieval models.
Vicinity — A unified ANN interface for many ANN algorithms that includes evaluation tools.
Publications
Thomas van Dongen, Gideon Maillette de Buy Wenniger, Lambert Schomaker. MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document Quality Prediction. arXiv:2308.07971, 2023. DOI
Thomas van Dongen, Gideon Maillette de Buy Wenniger, Lambert Schomaker. SChuBERT: Scholarly Document Chunks with BERT-encoding Boost Citation Count Prediction. ACL SDP Workshop, 2020. DOI
Gideon Maillette de Buy Wenniger, Thomas van Dongen, et al. Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction. ACL SDP Workshop, 2020. DOI
Technical Writing
- Model2Vec: Distill a Small Fast Model from any Sentence Transformer — Hugging Face Blog, 2024
- Demystifying Efficient Self-Attention — Towards Data Science, 2022
- Overcoming Input Length Constraints of Transformers — Towards Data Science, 2020
- Minish technical blog — Large number of technical posts related to open-source work
Media & Talks
Weaviate Podcast — Pyversity with Thomas van Dongen
Podcast about Pyversity and diversity in recommender systems.
YouTube • Spotify
Probabl Podcast — Time for some (extreme) distillation
Podcast about Model2Vec and embeddings models.
YouTube • Spotify
Contact
- Email: thomasvdongen@proton.me
- GitHub: @Pringled
- LinkedIn: thomas-van-dongen
- Google Scholar: Profile
- X/Twitter: @thomas_v_dongen