FEATURE: PDF support for rag pipeline #1118

SamSaffron · 2025-02-07T04:39:54Z

This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes:

1. LLM Model Association for RAG and Personas:

New Database Columns: Adds rag_llm_model_id to both ai_personas and ai_tools tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds default_llm_id and question_consolidator_llm_id to ai_personas.
Migration: Includes a migration (20250210032345_migrate_persona_to_llm_model_id.rb) to populate the new default_llm_id and question_consolidator_llm_id columns in ai_personas based on the existing default_llm and question_consolidator_llm string columns, and a post migration to remove the latter.
Model Changes: The AiPersona and AiTool models now belong_to an LlmModel via rag_llm_model_id. The LlmModel.proxy method now accepts an LlmModel instance instead of just an identifier. AiPersona now has default_llm_id and question_consolidator_llm_id attributes.
UI Updates: The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector.
Serialization: The serializers (AiCustomToolSerializer, AiCustomToolListSerializer, LocalizedAiPersonaSerializer) have been updated to include the new rag_llm_model_id, default_llm_id and question_consolidator_llm_id attributes.

2. PDF and Image Support for RAG:

Site Setting: Introduces a new hidden site setting, ai_rag_pdf_images_enabled, to control whether PDF and image files can be indexed for RAG. This defaults to false.
File Upload Validation: The RagDocumentFragmentsController now checks the ai_rag_pdf_images_enabled setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled.
PDF Processing: Adds a new utility class, DiscourseAi::Utils::PdfToImages, which uses ImageMagick (magick) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced.
Image Processing: A new utility class, DiscourseAi::Utils::ImageToText, is included to handle OCR for the images and PDFs.
RAG Digestion Job: The DigestRagUpload job now handles PDF and image uploads. It uses PdfToImages and ImageToText to extract text and create document fragments.
UI Updates: The RAG uploader component now accepts PDF and image file types if ai_rag_pdf_images_enabled is true. The UI text is adjusted to indicate supported file types.

3. Refactoring and Improvements:

LLM Enumeration: The DiscourseAi::Configuration::LlmEnumerator now provides a values_for_serialization method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend.
AI Helper: The AiHelper::Assistant now takes optional helper_llm and image_caption_llm parameters in its constructor, allowing for greater flexibility.
Bot and Persona Updates: Several updates were made across the codebase, changing the string based association to a LLM to the new model based.
Audit Logs: The DiscourseAi::Completions::Endpoints::Base now formats raw request payloads as pretty JSON for easier auditing.
Eval Script: An evaluation script is included.

4. Testing:

The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in /evals

(this starts by defining the extraction routines)

jjaffeux · 2025-02-12T10:39:44Z

evals/lib/llm.rb

+    @llm_model.vision_enabled
+  end
+
+  private


probably a forgotten private

jjaffeux · 2025-02-12T10:42:38Z

lib/utils/image_to_text.rb

+  def system_message
+    <<~MSG
+      OCR the following page into Markdown. Tables should be formatted as Github flavored markdown.
+      Do not sorround your output with triple backticks.


Suggested change

Do not sorround your output with triple backticks.

Do not surround your output with triple backticks.

jjaffeux · 2025-02-12T10:47:26Z

lib/utils/pdf_to_images.rb

+
+      @uploaded_pages = uploads
+    ensure
+      FileUtils.rm_rf(temp_dir) if Dir.exist?(temp_dir)


FileUtils.rm_rf(path) already handles non-existent paths gracefully.

Suggested change

FileUtils.rm_rf(temp_dir) if Dir.exist?(temp_dir)

FileUtils.rm_rf(temp_dir)

jjaffeux · 2025-02-12T10:51:51Z

lib/utils/pdf_to_images.rb

+    temp_dir = File.join(Dir.tmpdir, "discourse-pdf-#{SecureRandom.hex(8)}")
+    FileUtils.mkdir_p(temp_dir)


Suggested change

temp_dir = File.join(Dir.tmpdir, "discourse-pdf-#{SecureRandom.hex(8)}")

FileUtils.mkdir_p(temp_dir)

Dir.mktmpdir("discourse-pdf-#{SecureRandom.hex(8)}")

jjaffeux · 2025-02-12T10:54:41Z

lib/configuration/llm_enumerator.rb

@@ -50,6 +47,28 @@ def self.valid_value?(val)
        true
      end

+      # returns an array of hashes (id: , name:, vision_enabled:)
+      def self.values_for_serialization(allowed_seeded_llm_ids: nil)
+        #if allowed_seeded_llms.is_a?(Array) && !allowed_seeded_llms.empty?


did you mean to keep this?

jjaffeux · 2025-02-12T10:58:24Z

assets/javascripts/discourse/components/rag-options.gjs

+  get visionLlmId() {
+    return this.args.model.rag_llm_model_id || "blank";
+  }


Suggested change

get visionLlmId() {

return this.args.model.rag_llm_model_id || "blank";

}

get visionLlmId() {

return this.args.model.rag_llm_model_id ?? "blank";

}

I don't know if you expect 0 or negative ids, but I recommend this pattern for this kind of defaults, it's safer.

jjaffeux · 2025-02-12T11:00:08Z

assets/javascripts/discourse/components/ai-persona-editor.gjs

    }
  }

  get mappedDefaultLlm() {
-    return this.editingModel?.default_llm || "blank";
+    return this.editingModel?.default_llm_id || "blank";


Suggested change

return this.editingModel?.default_llm_id || "blank";

return this.editingModel?.default_llm_id ?? "blank";

I explain later why

jjaffeux · 2025-02-12T11:00:25Z

assets/javascripts/discourse/components/ai-persona-editor.gjs

@@ -167,27 +167,27 @@ export default class PersonaEditor extends Component {
  }

  get mappedQuestionConsolidatorLlm() {
-    return this.editingModel?.question_consolidator_llm || "blank";
+    return this.editingModel?.question_consolidator_llm_id || "blank";


Suggested change

return this.editingModel?.question_consolidator_llm_id || "blank";

return this.editingModel?.question_consolidator_llm_id ?? "blank";

SamSaffron marked this pull request as draft February 7, 2025 04:40

SamSaffron added 13 commits February 10, 2025 18:09

FEATURE: PDF support for rag pipeline

2d9c72c

(this starts by defining the extraction routines)

OK this now sort of works, need to extract llm selector

3c7dd74

work in progress, eval

b64511e

lets add a case that attempts to jailbreak proofread

34b9521

better output

2032f5f

introduce a log

d4695ec

allow regex

4d231c3

this is a jailbreak that intentionally breaks our prompt

0875406

moving evals to own repo, then we can have huge ones

18c6a80

infra for pdf evals

ace9f94

add new rag_llm_model_id which is used for ocr

4d1798c

move llm to id column - work in progress

fdd4a9b

fix various specs... a bunch left

2181e2a

SamSaffron force-pushed the pdf branch from 786fe95 to 2181e2a Compare February 10, 2025 07:13

SamSaffron added 12 commits February 10, 2025 18:39

fix more specs

72e9576

more experimental columns removed

4ba0d5c

another spec fixed

e2f71f1

fix more cases where default_llm was used

938a445

fix more specs

848692c

reduce mocking to make test more stable

2f4276a

specs passing, system specs next

5e9cb80

tests are passing, but stuff is not working yet..

5736da6

mostly working now, need better progress story and error handling

db6e28a

we need more time

f5ce2db

Move allowing or disallowing pdf/images to a site setting

25c97ca

fix tests

b0a549b

SamSaffron marked this pull request as ready for review February 12, 2025 00:55

SamSaffron added 2 commits February 12, 2025 14:21

refactor eval framework into a simpler structure

10ea742

image to text support

bcb7cdf

jjaffeux reviewed Feb 12, 2025

View reviewed changes

evals/lib/llm.rb

@llm_model.vision_enabled

end

private

Copy link

Contributor

jjaffeux Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a forgotten private

jjaffeux reviewed Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE: PDF support for rag pipeline #1118

FEATURE: PDF support for rag pipeline #1118

SamSaffron commented Feb 7, 2025 •

edited

Loading

jjaffeux Feb 12, 2025

jjaffeux Feb 12, 2025

jjaffeux Feb 12, 2025

jjaffeux Feb 12, 2025

jjaffeux Feb 12, 2025

jjaffeux Feb 12, 2025

jjaffeux Feb 12, 2025

jjaffeux Feb 12, 2025

	Do not sorround your output with triple backticks.
	Do not surround your output with triple backticks.

	FileUtils.rm_rf(temp_dir) if Dir.exist?(temp_dir)
	FileUtils.rm_rf(temp_dir)

		temp_dir = File.join(Dir.tmpdir, "discourse-pdf-#{SecureRandom.hex(8)}")
		FileUtils.mkdir_p(temp_dir)

	temp_dir = File.join(Dir.tmpdir, "discourse-pdf-#{SecureRandom.hex(8)}")
	FileUtils.mkdir_p(temp_dir)
	Dir.mktmpdir("discourse-pdf-#{SecureRandom.hex(8)}")

	return this.editingModel?.default_llm_id \|\| "blank";
	return this.editingModel?.default_llm_id ?? "blank";

	return this.editingModel?.question_consolidator_llm_id \|\| "blank";
	return this.editingModel?.question_consolidator_llm_id ?? "blank";

FEATURE: PDF support for rag pipeline #1118

Are you sure you want to change the base?

FEATURE: PDF support for rag pipeline #1118

Conversation

SamSaffron commented Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamSaffron commented Feb 7, 2025 •

edited

Loading