Preview VM available for download
- Research outreach without hosting expensive servers
- Values privacy, don't want to pay server fees
- Likes the VM technology for a different stack
- Likes to build (and sell) solutions on top
- Knows PHP
- Values Free software
- Interested in AI
Retrieval Augmented Generation: combine LLMs with existing information
Retrieve information
- Search engines, including embedding-based ones
- In the prompt
- For example, to answer a question
- I really want to...
- eat empanadas
- learn about LLMs
- Some of those texts contain exams and their answers
- Or directly "instructions" to make the LLM more useful
- How to generate these embeddings?
- Siamese neural networks
- What type of semantic information is being captured?
- How big is the span of text to make sense to compare the generated vectors?
- What if one of the texts is short (like a question) and the other is long (like a paragraph)?
- Asymmetric embeddings
LLMs can do many things (most of them badly)
Answer extraction is one they are good at. Given the text:
This text talks about many things. Among them is how 42 is a number featured in the Hitchhiker's Guide to the Galaxy, the comedy work by Douglas Adams.
People are obsessed with the number 42 because it is featured in the Hitchhiker's Guide to the Galaxy, a comedy work by Douglas Adams.
The obsession with the number 42 largely comes from its significance in the popular science fiction novel "The Hitchhiker's Guide to the Galaxy" by Douglas Adams. In the story, a group of hyper-intelligent beings builds a supercomputer named Deep Thought to calculate the "Answer to the Ultimate Question of Life, the Universe, and Everything." After much anticipation, Deep Thought reveals that the answer is simply the number 42.
Since the publication of the book, ...
- One of the reasons we wanted computers, to begin with
- RAG hinges on good IR
- If the information to answer the question is not retrieved, there is not much the LLM can do about it
- That doesn't mean the LLM will admit defeat, getting it to say "I don't know" is a tall order
Currently we have IR systems using:
- keyword search
- complex search queries (keywords plus operators)
- embeddings
The best performance uses a combination of these approaches
- This segment is called a "chunk" of the document
- Local LLMs are between 400 words to 1500 words
- Commercial LLMs available through APIs can process full books
- The size should be large enough to answer questions
- But small enough to fit into the LLM input and be semantically coherent to produce viable embeddings
- Provide multiple chunks to the LLM at once
- That might exhaust the input and confuse the LLM
- Some questions need information from multiple sources
- They can be sensitive to minuscule changes (like a carriage return character at the end of the prompt)
1 Use the following pieces of context to answer the question
2 at the end. If you don't know the answer, just say that you
3 don't know, don't try to make up an answer.
4
5 {context}
6
7 Question: {question}
8 Answer:
- There is no back-end, nothing runs "in the cloud"
- An embedding model and associated execution code
- A local LLM and its associated server
- A standalone search engine supporting both keywords and embeddings
- Web-accessible software to upload documents, index them and query them
- System Python upgrades break virtual environments
- Download a tool that will remain useful for years
- Docker
- Debian stable
- Composer
from PHP Semantic Search:
Open Neural Network Exchange
- Execute without large dependecies
- Runtime phones home without special parameters
Transformer implementation in C++
llama-server
allows for local API callsgguf
format
- Allows for mixed execution in low-RAM GPUs
- parsing/compilation/execution/cleanup is a functional paradigm. Side effects go to the DB
- Funded through an AI innovation grant from the Quebec government
KeywordIndex.php
Okapi BM25 implemented on top of SQLite3 text searchVectorIndex.php
embedding search using SQLite3 Vector Search (FAISS) extension
1\Textualization\SemanticSearch\Ingester::ingest([
2 "location"=>"index.db",
3 "class"=>"\\Textualization\\SemanticSearch\\VectorIndex"
4 ], [], "docs.jsonl");
1use \Textualization\SentenceTransphormers\SentenceRopherta;
2
3$model = new SentenceRopherta();
4$emb = $model->embeddings("Text");
5
6// alt. using the semantic search classes
7
8$e = \Textualization\SemanticSearch\SentenceTransphormerEmbedder();
9$emb = $e->encode("Text");
Index.php
download
and change the box/launch-llama.sh
script to use itsite/upload.php
Dockerfile
make-image.sh
- Specific document segmentation and detagging
- Improved IR using faceted search
- Handling additional file formats
- Plugging in more performat IR engines (e.g., Manticore/MariaDB vector)
- Deletion
- Hybrid embeddings + keywords
- Keywords most probably doesn't work due to chunk size
- Upgrade
- API
- A complex tokenizer handling 100 languages (SentencePiece) was also migrated to PHP:
- https://huggingface.co/intfloat/multilingual-e5-small
- The current VM does not have these requirements installed
- Or at least in Spanish
- Ideas?
Non-trivial PR will be listed as vetted ISVs
It is time to go back to the P in NLP
- Natural Language Processing
Successful LLM deployments need a lot of programming and smarts outside the LLM bits
The RAGged Edge Box project allows new players versed in traditional programming to join the field
Table of Contents | t |
---|---|
Exposé | ESC |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide next slide | c |
Notes | 2 |
Help | h |