Short example of Information Retrieval

Hi,

Some time ago, I did a small project on information retrieval. I think, it\s a good idea to share it with all its shortcomings. Here is the code. Sadly, the LLM part doesn’t work with the quantized model, so I commented it out. The project is a small information retrieval of a FAQ, where I want to map the correct answer to a question. In my example, it’s a 1:1 mapping between question and answer, but it also works with multiple answers.

I try to solve this via:

lexical search
semantic search
LLM search

This is a low effort post. Sorry for that. I just want share my project.

Lexical Search

Lexical search is a simple keyword-based search. Most search engines use this method as default (e.g. elasticsearch). It works by matching the keywords in the query with the keywords in the documents. The more keywords match, the higher the score. BM25 is a popular algorithm for this. One problem is, that BM25 needs exact word matches. On the other hand, it is fast and easy to implement.

Reference:

Semantic Search

Semantic search is a more advanced method that uses word embeddings to find similar words. It works by mapping the words in the query to a vector space and finding the closest vectors in the document space. This method is more flexible and can handle synonyms and related words. However, it is slower and requires more resources. In my example, I used two methods:

similarity search
ranking

With similarity search, I used the cosine similarity to find the closest vectors. With ranking, I used the dot product to find the best match.

Reference:

LLM Search

This doesn’t make sense at all. I just wanted to try it out. It works with a small LLM model locally, but failed on kaggle with a quantized model. I use a prompt and let the LLM evaluate the question-answer combination. I provide few examples, so we have few-shot learning. Overall, it was just playing around with the LLM and the result is not good. Besides that, it is also extremely slow, since I have to call the LLM for each question-answer combination. I think, it is not worth the effort.

Thank you for your attention.

Lexical Search#

Semantic Search#

LLM Search#

Lexical Search

Semantic Search

LLM Search