Skip to content

Manager Implementation Details

darwinj07 edited this page Dec 12, 2023 · 4 revisions

Predefined Requirements

Our goal was to have Manager be more capable than any other AI assistant service in the domain of task & schedule management. To achieve that, we expected Manager to be capable of the following:

  • Handle any kind of prompt that has the slightest possibility to be interpreted as operation on plans
  • Perform any type of operation that can be done in the application
  • Automatically categorize new plans within the user-defined categories based on the context
  • Be more efficient than manual operation done by user(e.g. operate with smaller latency than the time it takes for user to type in all the details when adding a plan)
  • Understand and utilize user-specific information such as records of existing plans & categories
    • “과제 카테고리 모두 삭제해줘.”
    • “’소개원실 기말고사’보다 1시간 뒤에 ‘치킨 시켜먹기’ 추가해줘.”
  • Give a brief summary of plans that meet the condition contained in the prompt

For this, we decided to use a large language model(LLM) to convert user prompts into a format that could be directly used for operations in the local application. Prompting a LLM to perform tasks with text and performing application specific tasks based on the output of a LLM were two completely diffrent things. We needed to decide on a highly expressive output format that was:

  1. Capable of expressing all kinds of requests
  2. Accurately generable by the selected LLM

Output Format & Base Model Selection

Output format and base language model were chosen based on the requirements above.

SQL vs. Custom Format

We first considered using a custom format that is more minimal and task-specific than SQL by defining functions in the local application and parsing prompts into arguments. However, despite our domain being narrow, the required expressivity was still extremely high due to the fact that our domain was mainly limiting the type of data being handled, not the type of operations.

SQL(specifically SQL for SQLite) features successfully met our requirements e.g. searching for records based on the condition(WHERE), operating on time data like numeric types, referencing existing records(SELECT) and so on. It also proved generable by LLMs and was selected as the output format.

GPT-4 vs. Fine-tunable Model

Our initial consideration was to fine-tune a model for the specific task of converting user prompts into SQL queries and perform inference on a cloud-based service. However, testing with KoAlpaca model fine-tuned with dataset of 550 input-output pairs generated by GPT-4 showed utterly lacking performance in both latency and accuracy. The lack in accuracy was due to training with limited data and using the lightest model(7B). The latter problem, along with high latency, was inevitable due to resource limitations so change of plans was necessary. Next we turned to GPT from OpenAI, which provided two latest versions GPT-4(fine-tuning unsupported) and GPT-3.5(fine-tunable). Testing the two models from base, GPT-4 showed far better performance in accuracy and stablility than GPT-3.5 with similar results in latency. Although GPT-4 did not support fine-tuning, it supported few-shot prompting which proved sufficient for our case in testing. GPT-4 was selected as our base model.

Prompt Engineering

There are 3 “roles” you can use when calling OpenAI’s Chat Completions API, in which GPT-4 is included. “system” role is used as the base instruction for generating output. “user” role is the input generated by user. “assistant” role is the output generated by the model. You can use “user” and “assistant” roles to provide context for past conversations, or to provide exemplars for few-shot prompting. We kept the temperature value at 0 to ensure consistent output.

We experimented with prompts structured in various ways in terms of the text itself, the role in the API call, and the data provided to increase the accuracy while keeping the latency as low as possible.

Latency

Latency was mainly dependent on the size of the output, with surprisingly low relation to the size of the input regardless of the role. So we focused on keeping the output as short as possible by guiding the model to output the shortest possible query out of the numerous choices of equivalent queries. Along with instructing the model to output only the necessary queries and choose the shortest query possible, we found cases where the model was not giving the shortest output and set up exemplars to fix them.

  • USER: //When using WHERE condition on month, use LIKE to minimize query length 이번달 과제 모두 삭제해줘.

    ASSISTANT: "DELETE FROM TODO WHERE category_id=1 AND date(due_time) LIKE '2023-11%';"

Data Provided to & Handled by GPT-4

Along with the basic user prompt, information such as current time and user-specific data needs to be provided as context. User-specific data includes records of existing plans & categories and past conversations. Including all of the data was not an option due to the token limitation and latency. Since SQL supports referencing values of existing records(SELECT), it was possible for the model to generate queries that could handle context without being provided the context directly. However, doing so would limit utilizing the powers of LLM in handling the data but only allow operations supported by SQL. Consider the user prompt “민석이랑 밥먹기 삭제해줘”. Without any user-specific data provided, the output query would be of the form “… WHERE title=’민석이랑 밥먹기’”. This would not generate the correct result if the plan was titled “민석이랑 밥”. Such inconsistencies were hard to handle using SQL operations alone, especially when the data type was not numeric. So for the title attribute of plans, being the most frequently used and also text type, we provided the data along with the id to the model. We also provided the title and id of categories. The model was prompted and trained to directly search the title and use the id of given data for condition on title attributes instead of WHERE condition of SQL. Along with enabling various user prompts relating to the title, this also enabled intelligent and automatic categorization of new plans based on the title of categories and values of the new plan.

SQLite also supports retrieval of current time. However, we decided to provide current time as part of the prompt for the same reason as above. The time operations supported by SQLite were not sufficient to handle conditions such as “11월 매주 화요일” which could only be correctly handled by providing the model with the current time explicitly.

Then why not provide as much information as possible and utilize LLM to the fullest? This proved problematic even without the token limitation because as the size of the instructions grew in the process of implementing a new data type, the rules in the prompt became harder and harder to enforce. An example of this was the previous case of enforcing direct search for condition on title attributes instead of WHERE condition which turned out challenging as the model behaved stubborn. Also, SQL operations were sufficient to handle data types other than text. For this reason, we only provided the title values along with id and current time to the model.

System Prompt

The instructions in our prompt requires the model to go through a procedural process. Such prompting in LLMs is a known strategy to achieve desired output but due to the nature of inference, the procedures are not followed in order and the rules in the procedures are not strictly enforced. This key difference of traditional programming and prompting an LLM imposed a challenge for us since the conversion task we required was quite deterministic. When the instructions became slightly large or complicated, the model would only adhere to a subset of the rules wile ignoring the procedures. Following is the work flow we followed to overcome this:

  1. Construct the prompt with minimal set of rules and simple decision cases.
  2. Test the output for compliance with the rules.
  3. Adjust the prompt by putting more weight on the rules for stubborn cases.
  4. Add exemplars in the prompt for unstable cases

Again, mentioning the case of enforcing direct search on title attributes instead of WHERE condition, GPT-4 kept trying to use WHERE condition on titles even when the instruction pointed otherwise. In cases like these, exaggerated expressions such as “NEVER”, “ALWAYS”, “NOT” were used in the prompt to guide the model to the correct direction. One interesting thing to note was that using uppercase “WHERE” to address the WHERE condition in SQL resulted in dramatic performance improvement from using lowercase “where” in the prompt.

In cases where simply using exaggerated expressions was insufficient to fix the stubborn behavior of GPT-4, we used the strategy of imposing the stubborn case as the exception and the neglected case as the default choice. For example, consider the case where you want the model to choose from two choices A and B but the model is stubborn in choosing A. Here, prompting the model to consider B as the only choice and A as an exceptional case would balance the output as desired.

Like so, engineering the system prompt was a repetitive process of experimenting with various test cases to check GPT-4’s tendencies for each rule and adjusting the prompt accordingly to achieve desired behavior.

System Prompt

Few-shot Prompting GPT-4

Few-shot prompting was used in the the two following cases:

  1. The output is unstable in adhering to the rules provided in the system prompt.
  2. A rule needs to be introduced but is too trivial to include in the system prompt.

Since the exemplars provided in the API call also count for the token limitation, we tried to keep the number as small as possible by formulating the system prompt to handle range of cases as wide as possible. The strategy of formulating compound exemplars that demonstrate multiple desired behaviors in a single case was also used.

The placement of user-specific data is crucial in few-shot prompting because the output needs to be independentl of such data in the exemplars. So we provide user-specific data(current time and id-title pairs of each table) as “user” role, keeping the system prompt unchanged regardless of the request. Using this strategy, we were able to confirm GPT-4 was not overfitting to the user-specific data used in the exemplars. Following is the user-specific data and list of exemplars used, with justification for each exemplar.

Few-shot Prompting Exemplars

API Call Process & Format

A list of messages with each message labeled with a role needs to be provided to the API. In the Calendy server, the list of messages contains a single system prompt, single “user” role prompt containing the user-specific data, and a series of exemplars of user prompts and desired output pairs. When a request is made, the provided user prompt and user-specific data are concatenated before being added to the list of messages as “user” role. Finally, the API call is made.

Summary Generation

For summary generation, we leveraged the existing module along with a separate server endpoint for API call with the prompt for summary generation. The flow is as follows:

  1. User sends prompt including ‘요약’ or ‘브리핑’.
  2. User prompt is sent to SQL conversion endpoint, which returns SELECT query.
  3. Data of the selected plan records is extracted locally.
  4. The data(+ current time data) is sent to summary generation endpoint which returns the summary.

The prompt for summary generation contains information about the format of input, output and length.

Summary Generation Prompt

Our initial intent was to utilize GPT-4 in detecting summary requests. The simplest way to achieve this would be by including the summary generation in the original prompt. This was not feasible in our case because the model isn’t provided with enough information about the plans, and such large prompt would be very unstable. The second consideration was to modify the original prompt to raise a flag when it detects a summary request and handle it locally. This required changing the output format to include information other than the queries, which turned out to be unstable and inaccurate. Considering the size of the prompt, a single line was not sufficient to guide the model to apply summary request detection logic consistently. There was also an option to utilize the “function” feature recently introduced by OpenAI to raise flags more directly, but testing showed it to be impractically inaccurate and unstable for large prompts such as ours.