Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: PDF support for rag pipeline #1118

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
2d9c72c
FEATURE: PDF support for rag pipeline
SamSaffron Feb 7, 2025
3c7dd74
OK this now sort of works, need to extract llm selector
SamSaffron Feb 7, 2025
b64511e
work in progress, eval
SamSaffron Feb 8, 2025
34b9521
lets add a case that attempts to jailbreak proofread
SamSaffron Feb 8, 2025
2032f5f
better output
SamSaffron Feb 8, 2025
d4695ec
introduce a log
SamSaffron Feb 8, 2025
4d231c3
allow regex
SamSaffron Feb 8, 2025
0875406
this is a jailbreak that intentionally breaks our prompt
SamSaffron Feb 8, 2025
18c6a80
moving evals to own repo, then we can have huge ones
SamSaffron Feb 9, 2025
ace9f94
infra for pdf evals
SamSaffron Feb 9, 2025
4d1798c
add new rag_llm_model_id which is used for ocr
SamSaffron Feb 10, 2025
fdd4a9b
move llm to id column - work in progress
SamSaffron Feb 10, 2025
2181e2a
fix various specs... a bunch left
SamSaffron Feb 10, 2025
72e9576
fix more specs
SamSaffron Feb 10, 2025
4ba0d5c
more experimental columns removed
SamSaffron Feb 10, 2025
e2f71f1
another spec fixed
SamSaffron Feb 10, 2025
938a445
fix more cases where default_llm was used
SamSaffron Feb 10, 2025
848692c
fix more specs
SamSaffron Feb 11, 2025
2f4276a
reduce mocking to make test more stable
SamSaffron Feb 11, 2025
5e9cb80
specs passing, system specs next
SamSaffron Feb 11, 2025
5736da6
tests are passing, but stuff is not working yet..
SamSaffron Feb 11, 2025
db6e28a
mostly working now, need better progress story and error handling
SamSaffron Feb 11, 2025
f5ce2db
we need more time
SamSaffron Feb 11, 2025
25c97ca
Move allowing or disallowing pdf/images to a site setting
SamSaffron Feb 12, 2025
b0a549b
fix tests
SamSaffron Feb 12, 2025
10ea742
refactor eval framework into a simpler structure
SamSaffron Feb 12, 2025
bcb7cdf
image to text support
SamSaffron Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ node_modules
/gems
/auto_generated
.env
evals/log
evals/cases
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,7 @@ export default class DiscourseAiToolsEditRoute extends DiscourseRoute {

controller.set("allTools", toolsModel);
controller.set("presets", toolsModel.resultSetMeta.presets);
controller.set("llms", toolsModel.resultSetMeta.llms);
controller.set("settings", toolsModel.resultSetMeta.settings);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,7 @@ export default class DiscourseAiToolsNewRoute extends DiscourseRoute {

controller.set("allTools", toolsModel);
controller.set("presets", toolsModel.resultSetMeta.presets);
controller.set("llms", toolsModel.resultSetMeta.llms);
controller.set("settings", toolsModel.resultSetMeta.settings);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
@tools={{this.allTools}}
@model={{this.model}}
@presets={{this.presets}}
@llms={{this.llms}}
@settings={{this.settings}}
/>
</section>
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
@tools={{this.allTools}}
@model={{this.model}}
@presets={{this.presets}}
@llms={{this.llms}}
@settings={{this.settings}}
/>
</section>
22 changes: 16 additions & 6 deletions app/controllers/discourse_ai/admin/ai_personas_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,19 @@ def index
}
end
llms =
DiscourseAi::Configuration::LlmEnumerator
.values(allowed_seeded_llms: SiteSetting.ai_bot_allowed_seeded_models)
.map { |hash| { id: hash[:value], name: hash[:name] } }
render json: { ai_personas: ai_personas, meta: { tools: tools, llms: llms } }
DiscourseAi::Configuration::LlmEnumerator.values_for_serialization(
allowed_seeded_llm_ids: SiteSetting.ai_bot_allowed_seeded_models_map,
)
render json: {
ai_personas: ai_personas,
meta: {
tools: tools,
llms: llms,
settings: {
rag_pdf_images_enabled: SiteSetting.ai_rag_pdf_images_enabled,
},
},
}
end

def new
Expand Down Expand Up @@ -187,15 +196,16 @@ def ai_persona_params
:priority,
:top_p,
:temperature,
:default_llm,
:default_llm_id,
:user_id,
:max_context_posts,
:vision_enabled,
:vision_max_pixels,
:rag_chunk_tokens,
:rag_chunk_overlap_tokens,
:rag_conversation_chunks,
:question_consolidator_llm,
:rag_llm_model_id,
:question_consolidator_llm_id,
:allow_chat_channel_mentions,
:allow_chat_direct_messages,
:allow_topic_mentions,
Expand Down
1 change: 1 addition & 0 deletions app/controllers/discourse_ai/admin/ai_tools_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ def ai_tool_params
:summary,
:rag_chunk_tokens,
:rag_chunk_overlap_tokens,
:rag_llm_model_id,
rag_uploads: [:id],
parameters: [:name, :type, :description, :required, enum: []],
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ def upload_file
def validate_extension!(filename)
extension = File.extname(filename)[1..-1] || ""
authorized_extensions = %w[txt md]
authorized_extensions.concat(%w[pdf png jpg jpeg]) if SiteSetting.ai_rag_pdf_images_enabled
if !authorized_extensions.include?(extension)
raise Discourse::InvalidParameters.new(
I18n.t(
Expand Down
35 changes: 33 additions & 2 deletions app/jobs/regular/digest_rag_upload.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def execute(args)

# Check if this is the first time we process this upload.
if fragment_ids.empty?
document = get_uploaded_file(upload)
document = get_uploaded_file(upload: upload, target: target)
return if document.nil?

RagDocumentFragment.publish_status(upload, { total: 0, indexed: 0, left: 0 })
Expand Down Expand Up @@ -163,7 +163,38 @@ def first_chunk(text, chunk_tokens:, tokenizer:, splitters: ["\n\n", "\n", ".",
[buffer, split_char]
end

def get_uploaded_file(upload)
def get_uploaded_file(upload:, target:)
if %w[pdf png jpg jpeg].include?(upload.extension) && !SiteSetting.ai_rag_pdf_images_enabled
raise Discourse::InvalidAccess.new(
"The setting ai_rag_pdf_images_enabled is false, can not index images and pdfs.",
)
end
if upload.extension == "pdf"
pages =
DiscourseAi::Utils::PdfToImages.new(
upload: upload,
user: Discourse.system_user,
).uploaded_pages

return(
DiscourseAi::Utils::ImageToText.as_fake_file(
uploads: pages,
llm_model: target.rag_llm_model,
user: Discourse.system_user,
)
)
end

if %w[png jpg jpeg].include?(upload.extension)
return(
DiscourseAi::Utils::ImageToText.as_fake_file(
uploads: [upload],
llm_model: target.rag_llm_model,
user: Discourse.system_user,
)
)
end

store = Discourse.store
@file ||=
if store.external?
Expand Down
88 changes: 46 additions & 42 deletions app/models/ai_persona.rb
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# frozen_string_literal: true

class AiPersona < ActiveRecord::Base
# TODO remove this line 01-1-2025
self.ignored_columns = %i[commands allow_chat mentionable]
# TODO remove this line 01-10-2025
self.ignored_columns = %i[default_llm question_consolidator_llm]

# places a hard limit, so per site we cache a maximum of 500 classes
MAX_PERSONAS_PER_SITE = 500
Expand All @@ -12,7 +12,7 @@ class AiPersona < ActiveRecord::Base
validates :system_prompt, presence: true, length: { maximum: 10_000_000 }
validate :system_persona_unchangeable, on: :update, if: :system
validate :chat_preconditions
validate :allowed_seeded_model, if: :default_llm
validate :allowed_seeded_model, if: :default_llm_id
validates :max_context_posts, numericality: { greater_than: 0 }, allow_nil: true
# leaves some room for growth but sets a maximum to avoid memory issues
# we may want to revisit this in the future
Expand All @@ -30,6 +30,10 @@ class AiPersona < ActiveRecord::Base
belongs_to :created_by, class_name: "User"
belongs_to :user

belongs_to :default_llm, class_name: "LlmModel"
belongs_to :question_consolidator_llm, class_name: "LlmModel"
belongs_to :rag_llm_model, class_name: "LlmModel"

has_many :upload_references, as: :target, dependent: :destroy
has_many :uploads, through: :upload_references

Expand Down Expand Up @@ -62,7 +66,7 @@ def self.persona_users(user: nil)
user_id: persona.user_id,
username: persona.user.username_lower,
allowed_group_ids: persona.allowed_group_ids,
default_llm: persona.default_llm,
default_llm_id: persona.default_llm_id,
force_default_llm: persona.force_default_llm,
allow_chat_channel_mentions: persona.allow_chat_channel_mentions,
allow_chat_direct_messages: persona.allow_chat_direct_messages,
Expand Down Expand Up @@ -157,12 +161,12 @@ def class_instance
user_id
system
mentionable
default_llm
default_llm_id
max_context_posts
vision_enabled
vision_max_pixels
rag_conversation_chunks
question_consolidator_llm
question_consolidator_llm_id
allow_chat_channel_mentions
allow_chat_direct_messages
allow_topic_mentions
Expand Down Expand Up @@ -302,7 +306,7 @@ def chat_preconditions
if (
allow_chat_channel_mentions || allow_chat_direct_messages || allow_topic_mentions ||
force_default_llm
) && !default_llm
) && !default_llm_id
errors.add(:default_llm, I18n.t("discourse_ai.ai_bot.personas.default_llm_required"))
end
end
Expand Down Expand Up @@ -332,13 +336,12 @@ def ensure_not_system
end

def allowed_seeded_model
return if default_llm.blank?
return if default_llm_id.blank?

llm = LlmModel.find_by(id: default_llm.split(":").last.to_i)
return if llm.nil?
return if !llm.seeded?
return if default_llm.nil?
return if !default_llm.seeded?

return if SiteSetting.ai_bot_allowed_seeded_models.include?(llm.id.to_s)
return if SiteSetting.ai_bot_allowed_seeded_models_map.include?(default_llm.id.to_s)

errors.add(:default_llm, I18n.t("discourse_ai.llm.configuration.invalid_seeded_model"))
end
Expand All @@ -348,36 +351,37 @@ def allowed_seeded_model
#
# Table name: ai_personas
#
# id :bigint not null, primary key
# name :string(100) not null
# description :string(2000) not null
# system_prompt :string(10000000) not null
# allowed_group_ids :integer default([]), not null, is an Array
# created_by_id :integer
# enabled :boolean default(TRUE), not null
# created_at :datetime not null
# updated_at :datetime not null
# system :boolean default(FALSE), not null
# priority :boolean default(FALSE), not null
# temperature :float
# top_p :float
# user_id :integer
# default_llm :text
# max_context_posts :integer
# vision_enabled :boolean default(FALSE), not null
# vision_max_pixels :integer default(1048576), not null
# rag_chunk_tokens :integer default(374), not null
# rag_chunk_overlap_tokens :integer default(10), not null
# rag_conversation_chunks :integer default(10), not null
# question_consolidator_llm :text
# tool_details :boolean default(TRUE), not null
# tools :json not null
# forced_tool_count :integer default(-1), not null
# allow_chat_channel_mentions :boolean default(FALSE), not null
# allow_chat_direct_messages :boolean default(FALSE), not null
# allow_topic_mentions :boolean default(FALSE), not null
# allow_personal_messages :boolean default(TRUE), not null
# force_default_llm :boolean default(FALSE), not null
# id :bigint not null, primary key
# name :string(100) not null
# description :string(2000) not null
# system_prompt :string(10000000) not null
# allowed_group_ids :integer default([]), not null, is an Array
# created_by_id :integer
# enabled :boolean default(TRUE), not null
# created_at :datetime not null
# updated_at :datetime not null
# system :boolean default(FALSE), not null
# priority :boolean default(FALSE), not null
# temperature :float
# top_p :float
# user_id :integer
# max_context_posts :integer
# vision_enabled :boolean default(FALSE), not null
# vision_max_pixels :integer default(1048576), not null
# rag_chunk_tokens :integer default(374), not null
# rag_chunk_overlap_tokens :integer default(10), not null
# rag_conversation_chunks :integer default(10), not null
# tool_details :boolean default(TRUE), not null
# tools :json not null
# forced_tool_count :integer default(-1), not null
# allow_chat_channel_mentions :boolean default(FALSE), not null
# allow_chat_direct_messages :boolean default(FALSE), not null
# allow_topic_mentions :boolean default(FALSE), not null
# allow_personal_messages :boolean default(TRUE), not null
# force_default_llm :boolean default(FALSE), not null
# rag_llm_model_id :bigint
# default_llm_id :bigint
# question_consolidator_llm_id :bigint
#
# Indexes
#
Expand Down
3 changes: 2 additions & 1 deletion app/models/ai_tool.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ class AiTool < ActiveRecord::Base
validates :script, presence: true, length: { maximum: 100_000 }
validates :created_by_id, presence: true
belongs_to :created_by, class_name: "User"
belongs_to :rag_llm_model, class_name: "LlmModel"
has_many :rag_document_fragments, dependent: :destroy, as: :target
has_many :upload_references, as: :target, dependent: :destroy
has_many :uploads, through: :upload_references
Expand Down Expand Up @@ -371,4 +372,4 @@ def self.presets
# rag_chunk_tokens :integer default(374), not null
# rag_chunk_overlap_tokens :integer default(10), not null
# tool_name :string(100) default(""), not null
#
# rag_llm_model_id :bigint
2 changes: 1 addition & 1 deletion app/models/llm_model.rb
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def self.provider_params
end

def to_llm
DiscourseAi::Completions::Llm.proxy(identifier)
DiscourseAi::Completions::Llm.proxy(self)
end

def identifier
Expand Down
8 changes: 7 additions & 1 deletion app/serializers/ai_custom_tool_list_serializer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,13 @@ class AiCustomToolListSerializer < ApplicationSerializer
has_many :ai_tools, serializer: AiCustomToolSerializer, embed: :objects

def meta
{ presets: AiTool.presets }
{
presets: AiTool.presets,
llms: DiscourseAi::Configuration::LlmEnumerator.values_for_serialization,
settings: {
rag_pdf_images_enabled: SiteSetting.ai_rag_pdf_images_enabled,
},
}
end

def ai_tools
Expand Down
1 change: 1 addition & 0 deletions app/serializers/ai_custom_tool_serializer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ class AiCustomToolSerializer < ApplicationSerializer
:script,
:rag_chunk_tokens,
:rag_chunk_overlap_tokens,
:rag_llm_model_id,
:created_by_id,
:created_at,
:updated_at
Expand Down
5 changes: 3 additions & 2 deletions app/serializers/localized_ai_persona_serializer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,16 @@ class LocalizedAiPersonaSerializer < ApplicationSerializer
:allowed_group_ids,
:temperature,
:top_p,
:default_llm,
:default_llm_id,
:user_id,
:max_context_posts,
:vision_enabled,
:vision_max_pixels,
:rag_chunk_tokens,
:rag_chunk_overlap_tokens,
:rag_conversation_chunks,
:question_consolidator_llm,
:rag_llm_model_id,
:question_consolidator_llm_id,
:tool_details,
:forced_tool_count,
:allow_chat_channel_mentions,
Expand Down
Loading