Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alternative that accept more correct answers, enable openai, anthropic, and gemini outputs to be viewed in eval-visualizer #237

Merged
merged 30 commits into from
Feb 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
ff2090c
add cost_in_cents to oai model output
rishsriv Feb 2, 2025
2f3644e
include alternative answer to date question, since question does not …
rishsriv Feb 2, 2025
4dd5afc
last 7 days inclusive of today = CURRENT_DATE - INTERVAL('6 days')
rishsriv Feb 2, 2025
549490a
(part 2) last 7 days inclusive of today = CURRENT_DATE - INTERVAL('6 …
rishsriv Feb 2, 2025
8e93be0
alternative acceptable datetime syntax
rishsriv Feb 2, 2025
d3fc693
(part 2) alternative acceptable datetime syntax
rishsriv Feb 2, 2025
13bda5d
all customers technically includes those with 0 transactions
rishsriv Feb 2, 2025
7fdcc83
include salespeople who made 0 sales
rishsriv Feb 2, 2025
46ea3e1
include doctors who have made 0 prescriptions
rishsriv Feb 2, 2025
dff0090
include alternative date format
rishsriv Feb 2, 2025
777f35f
add alternative answers (formatting)
rishsriv Feb 2, 2025
9478c4c
add alternative answer
rishsriv Feb 2, 2025
e3a3827
alternative correct answer (date formatting)
rishsriv Feb 2, 2025
57ce3d3
changed phrasing for greater clarity
rishsriv Feb 2, 2025
066cd22
add alternative answer (datetime format)
rishsriv Feb 2, 2025
7d7bf55
phrasing, align with data
rishsriv Feb 2, 2025
a869e1b
add alternative answer (datetime format)
rishsriv Feb 2, 2025
f186f9b
add alternative answer (datetime format - )
rishsriv Feb 2, 2025
c668e9f
update README.md
rishsriv Feb 2, 2025
98bd0ba
add results of openai runner to eval visualizer
rishsriv Feb 2, 2025
1be92af
add anthropic and gemini results to eval visualizer
rishsriv Feb 2, 2025
e9dd1ed
add alternate answer
rishsriv Feb 2, 2025
c8204ab
add alternate answer (current_timestamp vs current_date)
rishsriv Feb 2, 2025
29aecef
alternate answer (datetime format)
rishsriv Feb 2, 2025
ff163d2
fix typo
rishsriv Feb 2, 2025
146517c
accept alternative answer
rishsriv Feb 2, 2025
1555d29
accept alternative answer
rishsriv Feb 2, 2025
f29ba99
add eval-visualizer logging to deepseek runner
rishsriv Feb 2, 2025
bda1f41
lint
rishsriv Feb 2, 2025
3bc9063
added alternate correct answer
rishsriv Feb 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ python main.py \
-o results/openai_classic.csv results/openai_basic.csv results/openai_advanced.csv \
-g oa \
-f prompts/prompt_openai.json \
-m gpt-4-turbo \
-m o3-mini \
-p 5 \
-c 0
```
Expand Down Expand Up @@ -462,7 +462,7 @@ You can use the following flags in the command line to change the configurations
| CLI Flags | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| -g, --model_type | Model type used. Make sure this matches the model used. Currently defined options in `main.py` are `oa` for OpenAI models, `anthropic` for Anthropic models, `hf` for Hugging Face models, `vllm` for a vllm runner, `api` for API endpoints, `llama_cpp` for llama cpp, `mlx` for mlx, `bedrock` for AWS bedrock API, `together` for together.ai's API |
| -m, --model | Model that will be tested and used to generate the queries. Some options for OpenAI models are chat models `gpt-3.5-turbo-0613` and `gpt-4-0613`. Options for Anthropic include the latest claude-3 family of models (e.g. `claude-3-opus-20240229`). For Hugging Face, and VLLM models, simply use the path of your chosen model (e.g. `defog/sqlcoder`). |
| -m, --model | Model that will be tested and used to generate the queries. Some options for OpenAI models are chat models `gpt-4o` and `o3-mini`. Options for Anthropic include the latest claude-3 family of models (e.g. `claude-3-opus-20240229`). For Hugging Face, and VLLM models, simply use the path of your chosen model (e.g. `defog/sqlcoder`). |
| -a, --adapter | Path to the relevant adapter model you're using. Only available for the `hf_runner`. |
| --api_url | The URL of the custom API you want to send the prompt to. Only used when model_type is `api`. |
| -qz, --quantized | Indicate whether the model is an AWQ quantized model. Only available for `vllm_runner`. |
Expand Down Expand Up @@ -532,7 +532,7 @@ python main.py \
-o results/test.csv \
-g oa \
-f prompts/prompt_openai.json \
-m gpt-3.5-turbo-0613 \
-m gpt-4o-mini \
-n 1 \
--upload_url <your cloud function url>
```
Expand All @@ -550,7 +550,7 @@ python main.py \
-o results/test.csv \
-g oa \
-f prompts/prompt_openai.json \
-m gpt-3.5-turbo-0613 \
-m gpt-4o-mini \
-n 1 \
--upload_url http://127.0.0.1:8080/
```
Expand Down
18 changes: 9 additions & 9 deletions data/instruct_advanced_postgres.csv

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion data/instruct_basic_postgres.csv
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ derm_treatment,basic_group_order_limit,What are the top 2 specialties by number
derm_treatment,basic_left_join,"Return the patient IDs, first names and last names of patients who have not received any treatments.","SELECT p.patient_id, p.first_name, p.last_name FROM patients p LEFT JOIN treatments t ON p.patient_id = t.patient_id WHERE t.patient_id IS NULL"
derm_treatment,basic_left_join,Return the drug IDs and names of drugs that have not been used in any treatments.,"SELECT d.drug_id, d.drug_name FROM drugs d LEFT JOIN treatments t ON d.drug_id = t.drug_id WHERE t.drug_id IS NULL"
ewallet,basic_join_date_group_order_limit,"Who are the top 2 merchants (receiver type 1) by total transaction amount in the past 150 days (inclusive of 150 days ago)? Return the merchant name, total number of transactions, and total transaction amount.","SELECT m.name AS merchant_name, COUNT(t.txid) AS total_transactions, SUM(t.amount) AS total_amount FROM consumer_div.merchants m JOIN consumer_div.wallet_transactions_daily t ON m.mid = t.receiver_id WHERE t.receiver_type = 1 AND t.created_at >= CURRENT_DATE - INTERVAL '150 days' GROUP BY m.name ORDER BY total_amount DESC LIMIT 2"
ewallet,basic_join_date_group_order_limit,"How many distinct active users sent money per month in 2023? Return the number of active users per month (as a date), starting from the earliest date. Do not include merchants in the query. Only include successful transactions.","SELECT DATE_TRUNC('month', t.created_at) AS MONTH, COUNT(DISTINCT t.sender_id) AS active_users FROM consumer_div.wallet_transactions_daily t JOIN consumer_div.users u ON t.sender_id = u.uid WHERE t.sender_type = 0 AND t.status = 'success' AND u.status = 'active' AND t.created_at >= '2023-01-01' AND t.created_at < '2024-01-01' GROUP BY MONTH ORDER BY MONTH"
ewallet,basic_join_date_group_order_limit,"How many distinct active users sent money per month in 2023? Return the number of active users per month (as a date), starting from the earliest date. Do not include merchants in the query. Only include successful transactions.","SELECT DATE_TRUNC('month', t.created_at) AS MONTH, COUNT(DISTINCT t.sender_id) AS active_users FROM consumer_div.wallet_transactions_daily t JOIN consumer_div.users u ON t.sender_id = u.uid WHERE t.sender_type = 0 AND t.status = 'success' AND u.status = 'active' AND t.created_at >= '2023-01-01' AND t.created_at < '2024-01-01' GROUP BY MONTH ORDER BY MONTH;SELECT DATE_TRUNC('month', w.created_at)::DATE AS MONTH, COUNT(DISTINCT w.sender_id) AS active_user_count FROM consumer_div.wallet_transactions_daily w JOIN consumer_div.users u ON w.sender_id = u.uid WHERE w.sender_type = 0 AND w.status = 'success' AND w.created_at >= '2023-01-01' AND w.created_at < '2024-01-01' AND u.status = 'active' GROUP BY DATE_TRUNC('month', w.created_at) ORDER BY MONTH ASC;"
ewallet,basic_join_group_order_limit,"What are the top 3 most frequently used coupon codes? Return the coupon code, total number of redemptions, and total amount redeemed.","SELECT c.code AS coupon_code, COUNT(t.txid) AS redemption_count, SUM(t.amount) AS total_discount FROM consumer_div.coupons c JOIN consumer_div.wallet_transactions_daily t ON c.cid = t.coupon_id GROUP BY c.code ORDER BY redemption_count DESC LIMIT 3"
ewallet,basic_join_group_order_limit,"Which are the top 5 countries by total transaction amount sent by users, sender_type = 0? Return the country, number of distinct users who sent, and total transaction amount.","SELECT u.country, COUNT(DISTINCT t.sender_id) AS user_count, SUM(t.amount) AS total_amount FROM consumer_div.users u JOIN consumer_div.wallet_transactions_daily t ON u.uid = t.sender_id WHERE t.sender_type = 0 GROUP BY u.country ORDER BY total_amount DESC LIMIT 5"
ewallet,basic_join_distinct,"Return the distinct list of merchant IDs that have received money from a transaction. Consider all transaction types in the results you return, but only include the merchant ids in your final answer.",SELECT DISTINCT m.mid AS merchant_id FROM consumer_div.merchants m JOIN consumer_div.wallet_transactions_daily t ON m.mid = t.receiver_id WHERE t.receiver_type = 1
Expand Down
Loading