Content-Based Image Retrieval describes a system whereby we attempt to query a datastore of image features, given an input query of an image, to find and return similar images based on its content. The image features are generated by a model which has been trained on the dataset in an autoencoder architecture. These image embeddings are retrieved from the encoder, normally before the last fully-connected layer.
There are 2 main approaches to generating image embeddings:
-
Using feature extraction, whereby we use a pretrained image model such as ResNet, and extract the image features from the layer before the FC layers.
-
Using fine-tuning, whereby we remove the FC layers from the pretrained model, freeze its weights, attach a new FC layer to the model, retrain the model using a lower LR on our custom dataset.
From my own experiences, using fine-tuning yield better results but its a time consuming process as we need to tune the hyperparameters well in order for the decoder to learn to reproduce the image features properly before we can use the encoder to generate image embeddings. The entire process needs to be repeated when new images are added to the distribution.
As an experiment, I decided to try the DinoV2 model to test its effectivness in image retrieval without any fine-tuning involved.
I also wanted to try to use the Weaviate Vector database to store and retrieve image features.
1. Setting up Weaviate
Weaviate is an open-source vector database designed for AI workloads. It has pre-built modules to support text and image retrieval. In this example, we are not using the pre-built modules as we are generating our own embeddings. We use the configuration defaults as they are, which means its using the cosine distance metric as similarity measure.
We can run it using docker compose as follows:
We attach a new docker volume to store the DB data so it persists between restarts.
To test that its running, we can access http://localhost:8080 or use the Weaviate client in a python script:
Since the database is empty it should generate an empty response:
We will create the DB schema next.
2. Create DB schema
Data is stored in Weaviate as an object and each object belongs to a collection. We can create both of them at the same time by defining its properties in a dict and passing it to the client during schema creation:
The above creates an Image class to store our image embeddings. It stores the filepath as a string and the actual image content as a blob, which needs to be base64-encoded before storage.
Run the above script in a separate terminal to create the schema. Once complete, we can continue to creating our image embeddings.
3. Generating image features
The DINOv2 model is trained using self-supervised learning (SSL) on a specially curated dataset using a combination of SSL strategies and loss functions which makes it capable of learning image features without supervised fine-tuning.
For this example, we are using the Caltech 101 dataset.
To obtain the image features, we first load the model in a custom class with the required preprocessing:
The above loads the pretrained DINOv2 model from torch hub. We pass the input image through a preprocessor that resizes it to 244, center crop it to 224 and then apply normalization. This is passed as input directly to the model. The output returned is the image embedding, which is the output of the last transformer block which has passed through LayerNormalization and is a vector of shape 384. This is important as the image embedding needs to be a vector to be stored in the vector database.
Next, we create a custom script that can iterate over our image directory and store the embedding into our database:
The above script will iterate over each subdirector in a given directory and stores the image into the database. Note that the image needs to be converted to base64 to be stored as a blob, which is the role of img_to_base64 function. The delete_images function clears the database everytime this is run.
4. Create webapp to visualize
To test this out, I decided to create a simple webapp using Flask to visualize the output of image retrieval.
Note that the webapp doesn’t have any authentication or security and is meant to be a demo. Also the model inference should be in a separate service.
The image search is within the search function. It takes an image upload in the browser, stores it in a temp location, and applies the DINOv2 image embedding on it. Using the Weaviate client, it runs a custom query vector with the input image embedding. The query can be further customized with a max distance filter which can further filter out dissimilar images based on image distance computed.
Below are some screenshots of running some similarity searches.
Based on the visual outputs alone, the embeddings produced by DINOv2 is superior compared to using a pre-trained ResNet50 model. It’s able to recognise image features at an angle as in the first example. It’s also able to detect image content on its own, regardless of the background. This is not possible with a pre-trained Convnet model as it tends to pick up similar images based on image background alone.
In conclusion, it is possible to build a reliable image retrieval service using DinoV2 with Weaviate database.
There is still a lot to learn on Weaviate database and how the DINOv2 model works so this is left as an exercise to the reader.
H4PPY H4CK1NG