Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: PDF support for rag pipeline #1118

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

FEATURE: PDF support for rag pipeline #1118

wants to merge 27 commits into from

Conversation

SamSaffron
Copy link
Member

@SamSaffron SamSaffron commented Feb 7, 2025

This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes:

1. LLM Model Association for RAG and Personas:

  • New Database Columns: Adds rag_llm_model_id to both ai_personas and ai_tools tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds default_llm_id and question_consolidator_llm_id to ai_personas.
  • Migration: Includes a migration (20250210032345_migrate_persona_to_llm_model_id.rb) to populate the new default_llm_id and question_consolidator_llm_id columns in ai_personas based on the existing default_llm and question_consolidator_llm string columns, and a post migration to remove the latter.
  • Model Changes: The AiPersona and AiTool models now belong_to an LlmModel via rag_llm_model_id. The LlmModel.proxy method now accepts an LlmModel instance instead of just an identifier. AiPersona now has default_llm_id and question_consolidator_llm_id attributes.
  • UI Updates: The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector.
  • Serialization: The serializers (AiCustomToolSerializer, AiCustomToolListSerializer, LocalizedAiPersonaSerializer) have been updated to include the new rag_llm_model_id, default_llm_id and question_consolidator_llm_id attributes.

2. PDF and Image Support for RAG:

  • Site Setting: Introduces a new hidden site setting, ai_rag_pdf_images_enabled, to control whether PDF and image files can be indexed for RAG. This defaults to false.
  • File Upload Validation: The RagDocumentFragmentsController now checks the ai_rag_pdf_images_enabled setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled.
  • PDF Processing: Adds a new utility class, DiscourseAi::Utils::PdfToImages, which uses ImageMagick (magick) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced.
  • Image Processing: A new utility class, DiscourseAi::Utils::ImageToText, is included to handle OCR for the images and PDFs.
  • RAG Digestion Job: The DigestRagUpload job now handles PDF and image uploads. It uses PdfToImages and ImageToText to extract text and create document fragments.
  • UI Updates: The RAG uploader component now accepts PDF and image file types if ai_rag_pdf_images_enabled is true. The UI text is adjusted to indicate supported file types.

3. Refactoring and Improvements:

  • LLM Enumeration: The DiscourseAi::Configuration::LlmEnumerator now provides a values_for_serialization method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend.
  • AI Helper: The AiHelper::Assistant now takes optional helper_llm and image_caption_llm parameters in its constructor, allowing for greater flexibility.
  • Bot and Persona Updates: Several updates were made across the codebase, changing the string based association to a LLM to the new model based.
  • Audit Logs: The DiscourseAi::Completions::Endpoints::Base now formats raw request payloads as pretty JSON for easier auditing.
  • Eval Script: An evaluation script is included.

4. Testing:

  • The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in /evals

@SamSaffron SamSaffron marked this pull request as draft February 7, 2025 04:40
@SamSaffron SamSaffron marked this pull request as ready for review February 12, 2025 00:55
@llm_model.vision_enabled
end

private
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a forgotten private

def system_message
<<~MSG
OCR the following page into Markdown. Tables should be formatted as Github flavored markdown.
Do not sorround your output with triple backticks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Do not sorround your output with triple backticks.
Do not surround your output with triple backticks.


@uploaded_pages = uploads
ensure
FileUtils.rm_rf(temp_dir) if Dir.exist?(temp_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileUtils.rm_rf(path) already handles non-existent paths gracefully.

Suggested change
FileUtils.rm_rf(temp_dir) if Dir.exist?(temp_dir)
FileUtils.rm_rf(temp_dir)

Comment on lines +22 to +23
temp_dir = File.join(Dir.tmpdir, "discourse-pdf-#{SecureRandom.hex(8)}")
FileUtils.mkdir_p(temp_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
temp_dir = File.join(Dir.tmpdir, "discourse-pdf-#{SecureRandom.hex(8)}")
FileUtils.mkdir_p(temp_dir)
Dir.mktmpdir("discourse-pdf-#{SecureRandom.hex(8)}")

@@ -50,6 +47,28 @@ def self.valid_value?(val)
true
end

# returns an array of hashes (id: , name:, vision_enabled:)
def self.values_for_serialization(allowed_seeded_llm_ids: nil)
#if allowed_seeded_llms.is_a?(Array) && !allowed_seeded_llms.empty?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to keep this?

Comment on lines +30 to +32
get visionLlmId() {
return this.args.model.rag_llm_model_id || "blank";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
get visionLlmId() {
return this.args.model.rag_llm_model_id || "blank";
}
get visionLlmId() {
return this.args.model.rag_llm_model_id ?? "blank";
}

I don't know if you expect 0 or negative ids, but I recommend this pattern for this kind of defaults, it's safer.

Screenshot 2025-02-12 at 11 57 29

}
}

get mappedDefaultLlm() {
return this.editingModel?.default_llm || "blank";
return this.editingModel?.default_llm_id || "blank";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return this.editingModel?.default_llm_id || "blank";
return this.editingModel?.default_llm_id ?? "blank";

I explain later why

@@ -167,27 +167,27 @@ export default class PersonaEditor extends Component {
}

get mappedQuestionConsolidatorLlm() {
return this.editingModel?.question_consolidator_llm || "blank";
return this.editingModel?.question_consolidator_llm_id || "blank";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return this.editingModel?.question_consolidator_llm_id || "blank";
return this.editingModel?.question_consolidator_llm_id ?? "blank";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants