You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: examples/sentence-transformers-training/nli/README.md
+26-2
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,8 @@ Given two sentences (premise and hypothesis), the task of Natural Language Infer
4
4
5
5
The paper in [Conneau et al.](https://arxiv.org/abs/1705.02364) shows that NLI data can be quite useful when training Sentence Embedding methods. In [Sentence-BERT-Paper](https://arxiv.org/abs/1908.10084) NLI as a first fine-tuning step for sentence embedding methods has been used.
6
6
7
+
# General Models
8
+
7
9
## Single-card Training
8
10
9
11
To pre-train on the NLI task:
@@ -46,7 +48,29 @@ For multi-card training you can use the script of [gaudi_spawn.py](https://githu
## Single-card Training with LoRA+gradient_checkpointing
55
+
56
+
Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we can utilize LoRA and gradient checkpointing techniques to reduce the memory requirements, making it feasible to train the model on a single HPU.
Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we can use the Zero2/Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements.
65
+
66
+
Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero2.
In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. To further reduce memory usage, change the stage to 3 (DeepSpeed Zero3) in the `ds_config.json` file.
72
+
73
+
# Dataset
50
74
51
75
We combine [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) into a dataset we call [AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli). These two datasets contain sentence pairs and one of three labels: entailment, neutral, contradiction:
52
76
@@ -58,7 +82,7 @@ We combine [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNL
58
82
59
83
We format AllNLI in a few different subsets, compatible with different loss functions. See [triplet subset of AllNLI](https://huggingface.co/datasets/sentence-transformers/all-nli/viewer/triplet) as example.
Copy file name to clipboardexpand all lines: examples/sentence-transformers-training/sts/README.md
+27-2
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,8 @@ Semantic Textual Similarity (STS) assigns a score on the similarity of two texts
5
5
-**[training_stsbenchmark.py](training_stsbenchmark.py)** - This example shows how to create a SentenceTransformer model from scratch by using a pre-trained transformer model (e.g. [`distilbert-base-uncased`](https://huggingface.co/distilbert/distilbert-base-uncased)) together with a pooling layer.
6
6
-**[training_stsbenchmark_continue_training.py](training_stsbenchmark_continue_training.py)** - This example shows how to continue training on STS data for a previously created & trained SentenceTransformer model (e.g. [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)).
7
7
8
+
# General Models
9
+
8
10
## Single-card Training
9
11
10
12
To fine tune on the STS task:
@@ -33,7 +35,30 @@ For multi-card training you can use the script of [gaudi_spawn.py](https://githu
# Large Models (intfloat/e5-mistral-7b-instruct Model)
40
+
41
+
## Single-card Training with LoRA+gradient_checkpointing
42
+
43
+
Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we can utilize LoRA and gradient checkpointing techniques to reduce the memory requirements, making it feasible to train the model on a single HPU.
Pretraining the `intfloat/e5-mistral-7b-instruct` model requires approximately 130GB of memory, which exceeds the capacity of a single HPU (Gaudi 2 with 98GB memory). To address this, we can use the Zero2/Zero3 stages of DeepSpeed (model parallelism) to reduce the memory requirements.
52
+
53
+
Our tests have shown that training this model requires at least four HPUs when using DeepSpeed Zero2.
In the above command, we need to enable lazy mode with a learning rate of `1e-7` and configure DeepSpeed using the `ds_config.json` file. To further reduce memory usage, change the stage to 3 (DeepSpeed Zero3) in the `ds_config.json` file.
60
+
61
+
# Training data
37
62
38
63
Here is a simplified version of our training data:
0 commit comments