Skip to content

Commit

Permalink
fix, chore: Improve Groq translation speed and better translation wit…
Browse files Browse the repository at this point in the history
…h init prompt, update readme
  • Loading branch information
vTuanpham committed Aug 11, 2024
1 parent 0b8b56c commit 44bc8db
Show file tree
Hide file tree
Showing 5 changed files with 100 additions and 59 deletions.
34 changes: 24 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,25 +12,25 @@
</a>
</p>

The Large Dataset Translator is a robust solution crafted to effectively translate sizable datasets into diverse languages. It provides a smooth and parallelized translation process, guaranteeing swift outcomes without the need for an API key. The tool facilitates multithreaded processing, allowing users to translate extensive datasets in significantly less time. Additionally, it features an automatic fail-restart mechanism, ensuring the seamless continuation of the translation process in case of any interruptions.
The Large Dataset Translator is a powerful solution designed to efficiently translate large datasets into various languages. It offers a streamlined and parallelized translation process, ensuring fast results without the need for an API key. The tool supports multithreaded processing, enabling users to translate extensive datasets in less time. It also includes an automatic fail-restart mechanism, ensuring uninterrupted translation in case of any issues.

### Features
### Key Features

- **Parallelized Translation**: Utilizes multithread processing to split large datasets into chunks and translate in parallel, significantly reducing processing time.
- **Parallelized Translation**: Utilizes multithreaded processing to divide large datasets into chunks and translate them in parallel, significantly reducing processing time.

- **Handling Large Lists**: Efficiently handles datasets containing large lists (e.g., dialogs) by splitting them into sub-lists and translating each sub-list in parallel.
- **Handling Large Lists**: Efficiently handles datasets with large lists (e.g., dialogs) by splitting them into sub-lists and translating each sub-list in parallel.

- **Automatic Retry Mechanism**: Any thread that fails during translation will automatically restart with its specific chunk until all data points are fully translated.
- **Automatic Retry Mechanism**: Automatically restarts any failed translation threads, ensuring all data points are fully translated.

- **Data Format Compatibility**: Converts datasets into a format supported by pyarrow and huggingface-datasets for seamless integration.
- **Data Format Compatibility**: Converts datasets into formats supported by pyarrow and huggingface-datasets for seamless integration.

- **Pre-Translation Filters**: Filters can be applied before translation, such as removing examples that might contain code.
- **Pre-Translation Filters**: Apply filters before translation, such as removing examples that may contain code.

- **GIL Resilience**: Python Global Interpreter Lock (GIL) won't affect speed, as tasks consist of purely I/O-bound operations.
- **GIL Resilience**: Python Global Interpreter Lock (GIL) does not impact speed, as tasks primarily involve I/O-bound operations.

- **Automatic Download**: Automatically downloads the converted dataset and the translated dataset on Colab upon completion.
- **Automatic Download**: Automatically downloads the converted dataset and translated dataset on Colab upon completion.

- **Unlimited Translation**: No API key is required, making it ideal for translating large datasets without limitations.
- **Unlimited Translation**: No API key required, making it ideal for translating large datasets without limitations.

### Demonstration

Expand Down Expand Up @@ -79,6 +79,20 @@ python examples/YahmaAlpaca/AlpacaCleaned_Parser.py
```
Check the examples/YahmaAlpaca directory when the script finished, there should be a parsed dataset and a vietnamese dataset.

#### LLM-based Translation
For a higher quality translation using LLM with context-aware translation, you can utilize the following script:

```sh
%run examples/argilla-magpie-ultra-v0.1-groq/MagpieUltraV01.py
```
or locally with:
```sh
python examples/argilla-magpie-ultra-v0.1-groq/MagpieUltraV01.py
```

This script is capable of translating approximately 100 examples every 6-7 minutes using Groq. To use it, you will need to obtain a free [API key](https://console.groq.com/keys) and set the environment variable by executing `export GROQ_API_KEY=<your_api_key>`.


## Usage
### To translate your own dataset:
1. Inherit the DataParser class and implement your read and convert logic.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

PARSER_NAME = "MagpieUltraV01"

# Patience is the key since the data is large and using LLM based translator
# Patience is the key since the data is large and is using an LLM based translator
class MagpieUltraV01Parser(DataParser):
def __init__(self, file_path: str, output_path: str):
super().__init__(file_path, output_path,
Expand All @@ -23,18 +23,17 @@ def __init__(self, file_path: str, output_path: str):
target_fields=['question_text', 'orig_answer_texts'], # The data fields to be translated (The fields belong to BaseConfig)
do_translate=True,
no_translated_code=False,
translator=GroqProvider,
parser_callbacks=[VerboseCallback],
max_example_per_thread=400,
large_chunks_threshold=2000) # The callback to be called after the data has been converted and translated
translator=GroqProvider, # Groq is very slow but it is a high quality translator
parser_callbacks=[VerboseCallback], # The callback to be called after the data has been converted and translated
max_example_per_thread=50, # Set this to a lower number since a fail translation will cause the whole thread to restart, loosing all the progress of the thread
large_chunks_threshold=3000)

# Read function must assign data that has been read to self.data_read
def read(self) -> None:
# The read function must call the read function in DataParser class
# I just want to be sure that the file path is correct
super(MagpieUltraV01Parser, self).read()

# OpenOcra is pretty large, so adjust accordingly
self.data_read = load_dataset("argilla/magpie-ultra-v0.1")
self.system_prompts = load_dataset("teilomillet/system_prompt")

Expand All @@ -51,7 +50,12 @@ def convert(self) -> None:
for data in tqdm(self.data_read[split], desc=f"Converting {split} data"):
data_dict = {}
random_index = random.randint(0, len(self.system_prompts['train']) - 1)
data_dict['system_prompt'] = self.system_prompts['train'][random_index]['prompt']

if random.random() < 0.5:
data_dict['system_prompt'] = ""
else:
data_dict['system_prompt'] = self.system_prompts['train'][random_index]['prompt']

data_dict['qas_id'] = self.id_generator()
data_dict['question_text'] = data['instruction']
data_dict['orig_answer_texts'] = data['response']
Expand All @@ -60,14 +64,14 @@ def convert(self) -> None:
data_converted.append(data_dict)

# Be sure to assign the final data list to self.converted_data
self.converted_data = data_converted[:2000]
self.converted_data = data_converted[:6000]

return None


if __name__ == '__main__':
magpie_ultra_v01_parser = MagpieUltraV01Parser(r"examples/argilla-magpie-ultra-v0.1/dummy.txt",
r"examples/argilla-magpie-ultra-v0.1")
magpie_ultra_v01_parser = MagpieUltraV01Parser(r"examples/argilla-magpie-ultra-v0.1-groq/dummy.txt",
r"examples/argilla-magpie-ultra-v0.1-groq")
magpie_ultra_v01_parser.read()
magpie_ultra_v01_parser.convert()
magpie_ultra_v01_parser.save
Empty file.
5 changes: 5 additions & 0 deletions install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ echo "Installing dependencies..."

pip install -r requirements.txt
pip install groq==0.9.0

if [ -z "$GROQ_API_KEY" ]; then
echo "GROQ_API_KEY environment variable is not set. Please set it to your project's GROQ API key. to use the groq provider."
fi

pip install httpx==1.0.0.beta0 --force-reinstall

echo "Installation completed successfully!"
96 changes: 57 additions & 39 deletions providers/groq_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@
import sys
import json

from typing import Union, List, Optional
import concurrent.futures
from typing import Union, List

from pydantic import Field
sys.path.insert(0,r'./')
Expand All @@ -12,26 +11,32 @@
try:
from .base_provider import Provider
from .utils import *
from .google_provider import GoogleProvider
except ImportError:
from base_provider import Provider
from utils import *
from google_provider import GoogleProvider

# Cache the fail prompt to avoid running translation again for subsequent calls
CACHE_FAIL_PROMPT = []

# Max list length of 5, cache all prompt and remove old prompt if the length is greater than 5, fuzzy match the current prompt with the cache prompt and return the fail_translation_code if the similarity is greater than 0.8
CACHE_PROMPT = []
# Use GoogleProvider to translate the prefix system prompt and the postfix prompt to lean the model to translate the input data in their corresponding language
INIT_PROMPT_TRANSLATOR = GoogleProvider()
# Cache the init prompt to avoid running translation again for subsequent calls
CACHE_INIT_PROMPT = {}


# The GroqProvider class is a provider that uses the Groq API to translate text from one language to another via LLM, expect a high quality translation but it is very slow (100 examples every 6-7 minutes)
class GroqProvider(Provider):
def __init__(self):

try:
os.environ["GROQ_API_KEY"]
self.groq_client = Groq(
api_key=os.environ.get("GROQ_API_KEY"),
)
except KeyError:
raise KeyError("Please set the environment variable GROQ_API_KEY")
raise KeyError("Please set the environment variable GROQ_API_KEY by running `export GROQ_API_KEY=<your_api_key>`, the API key can be obtained from https://console.groq.com/keys, it is free to sign up and use the API.")

self.groq_client = Groq(
api_key=os.environ.get("GROQ_API_KEY"),
)
self.translator = self.groq_client.chat.completions.create

def construct_schema_prompt(self, schema: dict) -> str:
Expand All @@ -41,7 +46,7 @@ def construct_schema_prompt(self, schema: dict) -> str:

return schema_prompt + json_prompt

@throttle(calls_per_minute=30, verbose=False)
@throttle(calls_per_minute=200, verbose=False)
def _do_translate(self, input_data: Union[str, List[str]],
src: str, dest: str,
fail_translation_code:str = "P1OP1_F", # Pass in this code to replace the input_data if the exception is *unavoidable*, any example that contain this will be remove post translation
Expand All @@ -61,32 +66,54 @@ def _do_translate(self, input_data: Union[str, List[str]],

Translation = create_dynamic_model("Translation", translation_fields)

system_prompt = f"You are a helpful assistant that translates text from {from_language_name} to {dest_language_name}. You must consider things that should not be translated like names, places, code variables, etc. You should also consider the context of the text to provide the most accurate translation. You will only reply with the translation text and nothing else in JSON. \n\n{self.construct_schema_prompt(Translation.model_json_schema()['properties'])}"
postfix_prompt = f"Translate all the text above from {from_language_name} to {dest_language_name} and return the translations the corresonding fields in the JSON object."

prompt += f"\n\nTranslate all the text above from {from_language_name} to {dest_language_name} and return the translations the corresonding fields in the JSON object."
system_prompt = f"You are a helpful assistant that translates text from {from_language_name} to {dest_language_name}. You must consider things that should not be translated like names, places, code variables, latex, etc. You should also consider the context of the text to provide the most accurate translation. You will only reply with the **translation text** and nothing else in JSON."
postfix_system_prompt = f"{self.construct_schema_prompt(Translation.model_json_schema()['properties'])}"

else:
system_prompt = f"You are a helpful assistant that translates text from {from_language_name} to {dest_language_name}. You must consider things that should not be translated like names, places, code variables, latex, etc. You should also consider the context of the text to provide the most accurate translation. Only reply with the **translation text** and nothing else as this will be used directly, this is very important."

postfix_system_prompt = ""

prompt = input_data

postfix_prompt = f"Translate the above text from {from_language_name} to {dest_language_name}."

# Check if the init prompt is already in the cache
if (src, dest) not in CACHE_INIT_PROMPT:
translated_system_prompt = INIT_PROMPT_TRANSLATOR.translate(system_prompt, src=src, dest=dest)
translated_postfix_prompt = INIT_PROMPT_TRANSLATOR.translate(postfix_prompt, src=src, dest=dest)
# Cache the init prompt
if data_type == "list":
CACHE_INIT_PROMPT[(src, dest, "list")] = (translated_system_prompt, translated_postfix_prompt)
else:
CACHE_INIT_PROMPT[(src, dest)] = (translated_system_prompt, translated_postfix_prompt)

if data_type == "list":
translated_system_prompt, translated_postfix_prompt = CACHE_INIT_PROMPT[(src, dest, "list")]
else:
system_prompt = f"You are a helpful assistant that translates text from {from_language_name} to {dest_language_name}. You must consider things that should not be translated like names, places, code variables, etc. You should also consider the context of the text to provide the most accurate translation. Only reply with the translation text and nothing else."
translated_system_prompt, translated_postfix_prompt = CACHE_INIT_PROMPT[(src, dest)]

translated_system_prompt += "\n\n" + postfix_system_prompt if postfix_system_prompt else ""
translated_prompt = prompt + "\n\n" + translated_postfix_prompt

prompt = f"{input_data}\n\n Translate the above text from {from_language_name} to {dest_language_name}."

chat_args = {
"messages": [
{
"role": "system",
"content": system_prompt,
"content": translated_system_prompt,
},
{
"role": "user",
"content": prompt
"content": translated_prompt
}
],
"model": "gemma2-9b-it",
"temperature": 0.7,
"model": "llama3-8b-8192",
"temperature": 0.45,
"top_p": 0.5,
"max_tokens": 8000,
"frequency_penalty": 0.25,
"frequency_penalty": 0.4,
"presence_penalty": 0.25,
"stream": False,
}
Expand All @@ -98,19 +125,22 @@ def _do_translate(self, input_data: Union[str, List[str]],
if data_type == "list": return [fail_translation_code, fail_translation_code]
return fail_translation_code

if len(CACHE_PROMPT) > 5:
CACHE_PROMPT.pop(0)
# Clear the cache if the cache is too large
if len(CACHE_FAIL_PROMPT) > 5:
CACHE_FAIL_PROMPT.pop(0)
if len(CACHE_INIT_PROMPT) > 5:
CACHE_INIT_PROMPT.pop(0)

try:
output = self.translator(**chat_args)
except Exception as e:
# Check if the exception is unavoidable by fuzzy matching the prompt with the cache prompt
if fuzzy_match(input_data, CACHE_PROMPT, threshold=80):
if fuzzy_match(input_data, CACHE_FAIL_PROMPT, threshold=80):
print(f"Unavoidable exception: {e}")
if data_type == "list": return [fail_translation_code, fail_translation_code]
return fail_translation_code
else:
CACHE_PROMPT.append(input_data)
CACHE_FAIL_PROMPT.append(input_data)
raise e

if data_type == "list":
Expand All @@ -121,23 +151,11 @@ def _do_translate(self, input_data: Union[str, List[str]],
else:
return output.choices[0].message.content

def run_test_translation(test):
print(test.translate(["Hello", "How are you today ?"], src="en", dest="vi"))
print(test.translate("Hello", src="en", dest="vi"))

if __name__ == '__main__':
test = GroqProvider()
print(test.translate(["Hello", "How are you today ?"], src="en", dest="vi"))
print(test.translate("Hello", src="en", dest="vi"))
input_data = ["Hello", "How are you today ?"]
src = "en"
dest = "vi"

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(run_test_translation, GroqProvider()) for _ in range(30)]
for future in concurrent.futures.as_completed(futures):
try:
result = future.result()
except Exception as e:
print(f"Translation failed: {e}")

print(test.translate(["VIETNAMESE", "JAPANSESE"], src="en", dest="vi"))
print(test.translate("HELLO IN VIETNAMSE", src="en", dest="vi"))

0 comments on commit 44bc8db

Please sign in to comment.