praxis.txt

Καλή η θεωρεία ας περάσουμε στην πράξη τώρα. 
Ας υποθέσουμε ότι ο φάκελος all_data είναι το dataset με όλα τα αρχεία που θέλουμε να εκπαιδεύσουμε το μοντέλο.
Τα αρχεία αυτά στο σιγκεκριμένο κώδικα step-4teliko_arheia.py υποθέτουμε ότι είναι txt. Μπορείτε να βάλετε ότι θέλετε μέσα φωτογραφιες, βιντεο κτλ αρκεί να επεξεργαστείτε κατάλληλα το step-4teliko_arheia.py.
Εκτελώντας το step-4teliko_arheia.py θα γίνουν τα παρακάτω:

α)  Ανάγνωση των ονομάτων των αρχείων από τον φάκελο all_data
β)  Τυχαία ανάκατεμα των αρχείων
γ)  Υπολογισμός του συνολικού αριθμού αρχείων
δ)  Υπολογισμός του μεγέθους κάθε τμήματος (33%)
ε)  Δημιουργία φακέλων για τα 3 τμήματα
στ) Διαχωρισμός των αρχείων και μεταφορά στους αντίστοιχους φακέλους. folders = ['data1', 'data2', 'data3']
Ο φάκελος data2 θα χρησιμοποιηθεί ως train folder  μια που έχει αντιπροσωπευτικό δείγμα του 1/3 όλης της βασης. Όποιον και να επιλέξουμε το ίδιο κάνει φυσικά.  


Στην συνέχεια τρέχουμε το step-5create_test.py, όπου εκεί θα γίνουν τα παρακάτω:

α)  Θα δημιουργηθούν οι φάκελοι merged, test_data και val_data
β)  Οι φάκελοι 'data1' και 'data3 θα συγχωνευτούν σε ένα φάκελο merged. (Δηλαδή αυτην την στιγμή έχουμε το φακελο data2 και merged)
γ)  μετράμε τα αρχεία στον φάκελο merged και εντελώς τυχαία διαλέγουμε το 1/5 για τεστ δηλαδή για τον φάκελο test και άλλο 1/5 για τον φάκελο val

Αυτό ήταν, μπορούμε τώρα να δοκιμάσουμε να εκπαιδεύσουμε το μοντέλο μας και να δούμε αν το Triadic-Optimization αξίζει τον κόπο, αν χρήζει βελτίωσης ή είναι για τα σκουπήδια! 

train_texts = process_files("data2")
val_texts = process_files("val_data")
test_texts = process_files("test_data")

===========

Let's move on to practice now.

Assume that the all_data folder represents the dataset with all the files we want to use to train the model. In this particular code, step-4teliko_arheia.py, we assume that these files are text files. You can include any type of files you want within this folder (e.g., images, videos, etc.), as long as you appropriately modify step-4teliko_arheia.py.

Running step-4teliko_arheia.py will perform the following:

a) Read the names of the files from the all_data folder.
b) Randomly shuffle the files.
c) Calculate the total number of files.
d) Calculate the size of each segment (33%).
e) Create folders for the 3 segments.
f) Separate the files and move them to the respective folders. folders = ['data1', 'data2', 'data3']. The data2 folder will be used as the train folder since it represents a representative sample of one third of the entire dataset. It doesn't matter which one we choose; the same principle applies.

Next, we run step-5create_test.py, where the following actions take place:

a) Create the folders merged, test_data, and val_data.
b) Merge the data1 and data3 folders into one folder named merged. (So at this point, we have the data2 and merged folders.)
c) Count the files in the merged folder and randomly select one fifth for testing and another one fifth for validation.

That's it! Now, we can try training our model and see if Triadic Optimization is worth it, if it needs improvement, or if it's just garbage!

train_texts = process_files("data2")
val_texts = process_files("val_data")
test_texts = process_files("test_data")