RAGged Edge Box:
A private, single-user
AI-based IR system

Presenter Notes

Project page

Presenter Notes

Screenshots: (1) Upload

Upload The Art of Feature Engineering

Presenter Notes

Screenshots: (2) Upload results

4 minutes later...

Presenter Notes

Screenshots: (3) Query

When it is advisable to use feature normalization?

Presenter Notes

Screenshots: (4) Answered

Half a minute

Presenter Notes

Screenshots: (5) Other passages

Answer from here

Presenter Notes

Attendee personas

  • Scientist [Sci]
    • Research outreach without hosting expensive servers
  • Potential user [Usr]
    • Values privacy, don't want to pay server fees
  • Potential re-user [Tech]
    • Likes the VM technology for a different stack

Presenter Notes

Attendee personas (cont.)

  • Potential ISV [Tech$]
    • Likes to build (and sell) solutions on top
    • Knows PHP
  • Potential contributor [Tech]
    • Values Free software
    • Interested in AI

Presenter Notes

This talk

  1. RAGged Edge Box demo [User]
  2. AI Concepts (RAG, LLMs, Embeddings) [User, Tech]
  3. RAG Concepts (IR, chunk, prompt) [User, Tech]
  4. RAGged Edge Box (concept, advantages) [User]
  5. RAGged Edge Box Architecture [Tech]
  6. Enabling Technology Bits (ONNX, PHP Semantic Search) [Sci, Tech]
  7. Extension Points [Sci, Tech$]
  8. VM Packaging [Tech]
  9. RAGged Edge Box as a Platform [Tech$]

Presenter Notes

AI Concepts
[User, Tech]

Presenter Notes

What is RAG

  • Retrieval Augmented Generation: combine LLMs with existing information

  • Retrieve information

    • Search engines, including embedding-based ones
  • Give it to the LLM as input
    • In the prompt
  • Ask the LLM to do things with that information
    • For example, to answer a question

Presenter Notes

What are LLMs

  • Cellphone autocomplete is a language model
  • It "knows" things
    • I really want to...
      • eat empanadas
      • learn about LLMs
  • A large language model is similar but having seen tons of text
    • Some of those texts contain exams and their answers
    • Or directly "instructions" to make the LLM more useful

Presenter Notes

What are Embeddings

  • Encode information by projecting it to a fixed space
  • In the case of RAG and "semantic search" the input is a text span, the output is a fixed vector of floating point numbers (e.g., 384 numbers)
  • The hope is that text conveying similar meaning will be represented by vectors close to each other in the Euclidean space

Presenter Notes

Embeddings Issues

  • Many open questions
    • How to generate these embeddings?
      • Siamese neural networks
    • What type of semantic information is being captured?
    • How big is the span of text to make sense to compare the generated vectors?
    • What if one of the texts is short (like a question) and the other is long (like a paragraph)?
      • Asymmetric embeddings

Presenter Notes

RAG Concepts
[User, Tech]

Presenter Notes

Answer extraction using LLMs

  • LLMs can do many things (most of them badly)

  • Answer extraction is one they are good at. Given the text:

This text talks about many things. Among them is how 42 is a number featured in the Hitchhiker's Guide to the Galaxy, the comedy work by Douglas Adams.

  • We can ask the LLM to extract the answer to the question "Why people are obsessed with the number 42?"

People are obsessed with the number 42 because it is featured in the Hitchhiker's Guide to the Galaxy, a comedy work by Douglas Adams.

Presenter Notes

RAG vs 'Hallucinations'

  • Asked GPT-4 "Why people are obsessed with the number 42?"

The obsession with the number 42 largely comes from its significance in the popular science fiction novel "The Hitchhiker's Guide to the Galaxy" by Douglas Adams. In the story, a group of hyper-intelligent beings builds a supercomputer named Deep Thought to calculate the "Answer to the Ultimate Question of Life, the Universe, and Everything." After much anticipation, Deep Thought reveals that the answer is simply the number 42.


Since the publication of the book, ...

Presenter Notes

IR

  • Information retrieval has been around since the 1950s
    • One of the reasons we wanted computers, to begin with
  • GenAI is great but it hasn't improved IR as much
    • RAG hinges on good IR
    • If the information to answer the question is not retrieved, there is not much the LLM can do about it
    • That doesn't mean the LLM will admit defeat, getting it to say "I don't know" is a tall order
  • It's the GenAI revolution, not the IR revolution

Presenter Notes

IR SOTA

  • Currently we have IR systems using:

    • keyword search
    • complex search queries (keywords plus operators)
    • embeddings
  • The best performance uses a combination of these approaches

Presenter Notes

Chunk

  • The information provided to the LLM in a RAG system is usually a segment of the relevant document
    • This segment is called a "chunk" of the document
  • The full document cannot be processed by the LLM that have a maximum processing size including the input information and the answer output
    • Local LLMs are between 400 words to 1500 words
    • Commercial LLMs available through APIs can process full books

Presenter Notes

Chunk Issues

  1. Chunk size:
    • The size should be large enough to answer questions
    • But small enough to fit into the LLM input and be semantically coherent to produce viable embeddings
  2. Multi-chunk processing
    • Provide multiple chunks to the LLM at once
    • That might exhaust the input and confuse the LLM
    • Some questions need information from multiple sources

Presenter Notes

Prompt

  • How to structure the input is called the "prompt" of the LLM
  • Different LLMs need different prompts
    • They can be sensitive to minuscule changes (like a carriage return character at the end of the prompt)

Presenter Notes

RAG Prompt Example

1 Use the following pieces of context to answer the question 
2 at the end. If you don't know the answer, just say that you 
3 don't know, don't try to make up an answer.
4
5 {context}
6
7 Question: {question}
8 Answer:

Presenter Notes

RAGged Edge Box
[User]

Presenter Notes

RAGged Edge Box

  • RAGged Edge Box is a RAG system implemented as edge computing
    • There is no back-end, nothing runs "in the cloud"
  • Self-contained virtual machine with a bare-bones linux set-up with the key components for RAG:
    • An embedding model and associated execution code
    • A local LLM and its associated server
    • A standalone search engine supporting both keywords and embeddings
    • Web-accessible software to upload documents, index them and query them

Presenter Notes

Privacy

  • Searching in a document collection should not involve giving access to the documents to a third party
  • "Cloud" is a misnomer: it is just someone else's computer

Presenter Notes

Technical Sovereignty

  • Solutions relying on APIs hosted in other countries are not particular soverign
  • Solutions relying on expensive hardware are not particular soverign

Presenter Notes

Against Planned Obsolescence

  • Tired of tools that no longer work after a few months?
    • System Python upgrades break virtual environments
  • VM life cycle independent of the operating system life cycle
    • Download a tool that will remain useful for years
  • Dependencies packed in long terms storage solutions:
    • Docker
    • Debian stable
    • Composer

Presenter Notes

RAGged Edge Box Architecture [Tech]

Presenter Notes

Presenter Notes

Enabling Technology Bits
[Sci, Tech]

Presenter Notes

ONNX

  • https://onnx.ai/

  • Open Neural Network Exchange

  • Specify the neural network graph in a vendor-independent manner
  • Train using any framework, execute using a different framework
    • Execute without large dependecies
  • Open source ONNX runtime created and maintained by Microsoft
    • Runtime phones home without special parameters
  • Packed for PHP by Andrew Kane: https://packagist.org/packages/ankane/onnxruntime

Presenter Notes

llama.cpp

  • https://github.com/ggerganov/llama.cpp

  • Transformer implementation in C++

  • High performance execution on CPU
  • llama-server allows for local API calls
  • Supports LLMs in gguf format
    • Allows for mixed execution in low-RAM GPUs

Presenter Notes

Why PHP

  • Large installed code bases (Wikimedia, WordPress, Nextcloud)
  • Used by nearly 80% of all websites
  • Brainpower available worldwide
  • A PHP install is simpler and smaller than a Python/Java/etc install
    • parsing/compilation/execution/cleanup is a functional paradigm. Side effects go to the DB
  • 11th most popular programming language in 2023

Presenter Notes

PHP Semantic Search classes

Presenter Notes

Example

1\Textualization\SemanticSearch\Ingester::ingest([ 
2   "location"=>"index.db",
3   "class"=>"\\Textualization\\SemanticSearch\\VectorIndex" 
4 ], [], "docs.jsonl");

Presenter Notes

Sentence Transformers (Embeddings)

1use \Textualization\SentenceTransphormers\SentenceRopherta;
2
3$model = new SentenceRopherta();
4$emb = $model->embeddings("Text");
5
6// alt. using the semantic search classes
7
8$e = \Textualization\SemanticSearch\SentenceTransphormerEmbedder();
9$emb = $e->encode("Text");

Presenter Notes

Reverse Engineering Huggingface Components

  • This was by far the most time consuming aspect of the work last year

Presenter Notes

Extending RAGged Edge Box
[Sci, Tech]

Presenter Notes

New IR

Presenter Notes

New LLM

Presenter Notes

Better document handling

Presenter Notes

VM Packaging
[Tech]

Presenter Notes

Generating Virtual Box images programmatically

Presenter Notes

RAGged Edge Box as a Platform
[Tech]

Presenter Notes

Business

  • My hope is that the project enables ISVs to adapt the PHP code for customer specific needs:
    • Specific document segmentation and detagging
    • Improved IR using faceted search
    • Handling additional file formats
    • Plugging in more performat IR engines (e.g., Manticore/MariaDB vector)
  • Not all GenAI money should find its way back to Nvidia/Microsoft

Presenter Notes

Status

  • Full automatic VM creation
  • Missing functionality:
    • Deletion
    • Hybrid embeddings + keywords
    • Keywords most probably doesn't work due to chunk size
    • Upgrade
    • API

Presenter Notes

Multilinguality

Presenter Notes

Multilinguality: Needs

  • We need a small local LLM that can do multilingual answer extraction
    • Or at least in Spanish
    • Ideas?

Presenter Notes

Contributing to the Project

Presenter Notes

Other Announcements

Presenter Notes

Conclusions

  • It is time to go back to the P in NLP

    • Natural Language Processing
  • Successful LLM deployments need a lot of programming and smarts outside the LLM bits

  • The RAGged Edge Box project allows new players versed in traditional programming to join the field

Presenter Notes