0010 — Vector embeddings + image search via OpenSearch k-NN¶
- Status: accepted
- Date: 2026-04-26
- Deciders: @kackey621, Willen Federation contributors
Context and Problem Statement¶
The M6 scope requires:
- Image-based item search — operators upload a photo of a product, the system returns the closest matches (an image-embedding nearest-neighbour query).
- Similar-product suggestions at registration time — when an operator enters a new item, surface candidates likely to be duplicates or variants.
- Keyword search — full-text queries across item titles, descriptions, attribute values, and storage location.
We need an indexing tier that handles both vector similarity and full-text scoring, ships as a single component, and runs reasonably on shared / single-host deployments.
Decision Drivers¶
- One index, both queries — running a vector DB next to a search engine doubles operations.
- Self-hostable — operators must be able to run it on their own boxes; no SaaS dependencies on the data path.
- MariaDB compatibility — our primary store is MariaDB (cf. ADR 0007) and it has no native vector type. Embeddings must live elsewhere.
- Composes with ADR 0009 — embedding production is independent of storage.
Considered Options¶
Option A — pgvector (would require migrating to PostgreSQL)¶
- (+) Mature, single-store solution.
- (−) Requires migrating off MariaDB. Out of scope and would invalidate every existing migration.
Option B — Dedicated vector DB (Qdrant, Milvus, Weaviate)¶
- (+) Best-in-class vector performance.
- (−) Adds a second deployable component. Full-text search still needs another tool (or we accept poor LIKE-based queries).
- (−) Two operational dependencies for the operator.
Option C — OpenSearch with k-NN plugin¶
OpenSearch ships native k-NN (HNSW) and Lucene full-text in one engine. Indexing an item produces both a text field (for BM25 scoring on title/description/attributes) and a knn_vector field (for image / text embedding similarity). One query API, one operational footprint.
- (+) Single component for both query types.
- (+) Apache 2.0, fully self-hostable, supports JVM-tunable resource limits.
- (+) Mature PHP client (
opensearch-project/opensearch-php). - (+) Composes with the OpenFeature flag for gradual rollout — when the index is unavailable the application falls back to a
NullSearchIndex(MariaDB LIKE). - (−) JVM dependency means another container in
docker-compose.yml. Acceptable cost; we already run MariaDB and (optionally) Keycloak. - (−) Index rebuilds are slow for large catalogues. M6 ships an offline reindex script (
scripts/reindex_opensearch.php).
Decision Outcome¶
Chosen option: C — OpenSearch with k-NN.
Index design¶
Two primary indices:
saso_items— one document perItem. Fields:id(long, pk)title(text + keyword sub-field)description(text)barcode(keyword) — ISBN / EAN / JAN / UPCcategory_path(keyword array)storage_location_code(keyword)attributes(nested:namekeyword +valuetext)text_embedding(knn_vector, dim=embedding.dimfromsystem_setting)image_embedding(knn_vector, dim same as above; multi-valued for items with multiple photos)saso_storage_locations— one document per location code (warehouse → row → column → bin). Used by location autocomplete.
Embedding dimensionality is configurable so operators can switch models (e.g. 1536 for text-embedding-3-large, 768 for gemini-embedding-001) without altering the index by hand. The reindex script reads ai.embedding.dim and rebuilds.
Code shape¶
src/
├── Domain/
│ └── Search/
│ ├── SearchIndex.php # interface
│ ├── SearchQuery.php # value object
│ ├── SearchResult.php # value object
│ └── SimilarityRequest.php # vector + k
└── Infrastructure/
└── Search/
├── OpenSearchSearchIndex.php
├── NullSearchIndex.php # fallback when OpenSearch is down
└── OpenSearchClientFactory.php
Indexing pipeline¶
Item writes (insert, update, delete) emit a domain event; a Symfony Messenger handler (cf. ADR 0013) calls SearchIndex::upsert() / delete(). The embedding payload is produced via AiAssistant::embed() (ADR 0009) — async so the user-facing transaction commits without waiting on the embedding call.
Cache layer¶
A Redis cache (cf. ADR 0012) sits in front of the most common queries (top-N popular keyword searches, recent similarity searches). Cache misses fall through to OpenSearch.
Failure modes¶
- OpenSearch unreachable → controller falls back to
NullSearchIndex, which serves results from MariaDB LIKE (degraded but available). - Embedding provider unreachable → similarity search returns empty with a warning banner; full-text still works.
- Index drift (item updated but indexer not yet processed) → operators can trigger a row-level reindex from the admin UI.
Consequences¶
- OpenSearch becomes a runtime dependency of M6 features. The
docker-compose.ymlgains anopensearchservice behind a--profile searchflag. - Embeddings live in OpenSearch only — MariaDB never stores vectors. Rotating the embedding model is a reindex, not a schema migration.
- Two query types (keyword + similarity) hit one engine, one client, one set of operational concerns.
- Operators who don't want OpenSearch get the
NullSearchIndexpath. The product still works; search ranking just falls back to LIKE. - Future work — multi-tenant isolation (per-organisation indices) is not addressed here; M6 assumes single-tenant deployments.