FSL is based on "Formulate" library available at https://github.com/l0d0v1c/formulate and the local unit FSL.py
#!pip install "https://github.com/l0d0v1c/formulate/blob/main/dist/formulate-1.3-py3-none-any.whl?raw=true"
# uncomment the line above to install Formulate
from formulate.components import components
from FSL import formulationsymboliclanguage
FSL is a language focused on formulation description and deep learning. A formulation is a list of ingredients and quantities. FSL transforms this recipe in a string inspired by SMILES language used to represent molecules. These strings may be used for instance to train a deep auto encoder and generate new formulations from existing ones
Ingredients can be either major or minor. Major components are the ones usually present in significant amount, minor ones are usually additives used to modify properties of the formulation, like colouring or viscosity agents. Major ingredients are encoded in latin alphabet and minor one is greek. To be included in FSL each FORMULATE object must embed a minor <True|False> property.
Considering Air as Oxygen/Nitrogen major ingredients and a minor water additive
c=components(physical={"∆Hf":True,"rho":None,"minor":None})
c.add("Water","H2O",{'∆Hf':-285.83,"rho":1.0,'minor':True})
c.add("Nitrogen","N2",{'∆Hf':0,"rho":0.01,'minor':False})
c.add("Oxygen","O2",{'∆Hf':0,"rho":0.01,'minor':False})
c.setrates({"Water":0.01,"Oxygen":0.19,'Nitrogen':0.8})
c.mixing()
Component | Rate | N | O | H | ∆Hf | rho | minor | |
---|---|---|---|---|---|---|---|---|
0 | Water | 0.01 | 0.0000 | 55.50800 | 111.01700 | -15865.9700 | 1.0 | 1 |
1 | Nitrogen | 0.80 | 71.3940 | 0.00000 | 0.00000 | 0.0000 | 0.01 | 0 |
2 | Oxygen | 0.19 | 0.0000 | 62.50200 | 0.00000 | 0.0000 | 0.01 | 0 |
3 | Formulation | 1.00 | 57.1152 | 12.43046 | 1.11017 | -158.6597 | Non additive | Non additive |
We can now encode the air formulation as
from IPython.display import display, HTML
f=formulationsymboliclanguage([c])
e=f.encode([c])
display(HTML(f"<span style='font-size:3em'>{e[0]}</span>"))
ABα
The dictionary of ingredients is
f.dict
{'Water': 'α', 'Nitrogen': 'A', 'Oxygen': 'B'}
To train an autoencoder we need a list of formulations having the same ingredients at several quantities. During the FSL initialisation process you can define a "dose". In formulation recipes, the quantity of each component is often given in units (oz, parts..). FSL uses the same representation:
formulationsymboliclanguage(formulae,granulo=5)
means for each ingredient the delta between the maximum and the minimum quantity is splitted in 5 doses. So CCCD means 3 doses of C and one of D. Minor components are only represented by one letter.
Let's try encoding a recipes book of cocktails
import pandas as pd
df=pd.read_excel("cocktails.xlsx")
df.head()
Unnamed: 0 | nom | categ | i1 | d1 | i2 | d2 | i3 | d3 | i4 | d4 | i5 | d5 | i6 | d6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Gauguin | Cocktail Classics | Light Rum | 2.0 | Passion Fruit Syrup | 1.0 | Lemon Juice | 1.00 | Lime Juice | 1.00 | NaN | NaN | NaN | NaN |
1 | 1 | Fort Lauderdale | Cocktail Classics | Light Rum | 1.5 | Sweet Vermouth | 0.5 | Juice of Orange | 0.25 | Juice of a Lime | 0.25 | NaN | NaN | NaN | NaN |
2 | 2 | Apple Pie | Cordials and Liqueurs | Apple schnapps | 3.0 | Cinnamon schnapps | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3 | Cuban Cocktail No. 1 | Cocktail Classics | Juice of a Lime | 0.5 | Powdered Sugar | 0.5 | Light Rum | 2.00 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 4 | Cool Carlos | Cocktail Classics | Dark rum | 1.5 | Cranberry Juice | 2.0 | Pineapple Juice | 2.00 | Orange curacao | 1.00 | Sour Mix | 1.0 | NaN | NaN |
Now we have to transform this sheet in a list of formulations. As many ingredients are only used a few times they are not usable for a deep learning training. So we can limit the major ingredients list to the ones uses in more than 30 recipes. The rare ingredients are represented as minors
from collections import Counter
ingredients=[]
for i in range(1,7):
for j in df[f"i{i}"].tolist():
ingredients.append(j)
ingredients=Counter(ingredients)
composant={}
for name,cnt in ingredients.items():
if cnt>30:
composant[name]={'minor':False}
print(f"based on {len(composant)} ingredients")
listcompo=[]
for i,j in df.iterrows():
try:
cp=components(physical={"minor":None})
rates={}
for k in range(1,7):
if j[f"d{k}"]==j[f"d{k}"] and j[f"i{k}"]==j[f"i{k}"] : #not nan
name=j[f"i{k}"]
if name in composant:
rate=j[f"d{k}"]
cp.add(name,"",{'minor':False})
rates[name]=rate
else:
cp.add(name,"",{'minor':True})
rates[name]=0.001
cp.setrates(rates)
cp.mixing()
except:
pass
listcompo.append(cp)
based on 23 ingredients
For instance we can inpect the first cocktail
listcompo[0].formulationlist
Component | Rate | minor | |
---|---|---|---|
0 | Light Rum | 0.666 | 0 |
1 | Passion Fruit Syrup | 0.000 | 1 |
2 | Lemon Juice | 0.333 | 0 |
3 | Lime Juice | 0.000 | 1 |
4 | Formulation | 1.000 | Non additive |
Then encode the full recipe's book
cocktails=formulationsymboliclanguage(listcompo,granulo=10,verbose=False)
As the number of minor ingredients is limited to the length of the greek alphabet some of them are not encoded. It is possible to use longer alphabet by changing the lists
formulationsymboliclanguage.major=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
formulationsymboliclanguage.major=list('αβγδεζηθικλμνξοπρστυφχψω')
so we can now get an encoded training set. You may display unencoded ingredients by specifying verbose=True
encoded=cocktails.encode(listcompo)
The first cocktail is encoded as
display(HTML(f"<div style='font-size:3em;'>Encoded recipe 0 : {encoded[0]}</div>"))
If you check what means A
name={j:i for i,j in cocktails.dict.items()}['A']
print(f"Ingredient: {name}")
print(f"Minimum in recipes : {cocktails.min[name]}, maximum: {cocktails.max[name]}")
print(f"One dose of {name} is {cocktails.delta[name]}")
Ingredient: Light Rum
Minimum in recipes : 0.04, maximum: 1.0
One dose of Light Rum is 0.096
Encoding is a balance between accuracy (as the quantities are encoded as a number of doses) and the number of available recipes. Having long encoded FSL strings gives a good accuracy but requires a lot of recipes to train a deep encoder. For instance, let's decode the encoded recipe
display(HTML("<span style='font-size:2em;'>FSL encoded recipe is:</span>"))
display(cocktails.decode([encoded[0]])[0].formulationlist)
display(HTML("<span style='font-size:2em;'>And the original recipe was:</span>"))
display(listcompo[0].formulationlist)
FSL encoded recipe is:
Component | Rate | minor | |
---|---|---|---|
0 | Light Rum | 0.633 | False |
1 | Lemon Juice | 0.365 | False |
2 | Passion Fruit Syrup | 0.001 | True |
3 | Lime Juice | 0.001 | True |
4 | Formulation | 1.000 | Non additive |
And the original recipe was:
Component | Rate | minor | |
---|---|---|---|
0 | Light Rum | 0.666 | 0 |
1 | Passion Fruit Syrup | 0.000 | 1 |
2 | Lemon Juice | 0.333 | 0 |
3 | Lime Juice | 0.000 | 1 |
4 | Formulation | 1.000 | Non additive |
Assessment of FSL language in an autoencoder. In this version, we will reload a pretrained neural network.
#!pip install tensorflow pandas textdistance
import pickle,gzip,sys
from rdmediationvaert import AE
import pandas as pd
cocktails,encodeur=pickle.load(gzip.open("cocktails.pklz"))
dataset=[]
for m in encodeur:
if len(m)>2:
dataset.append(m)
print(f"{len(dataset)} formulae for training")
model=AE(name='cocktailsvae')
model.reload('cocktailsvae')
c=dataset[0]
print(f"FSL encoded formula : {c}")
print("Decoded formula:")
cocktails.decode([c])[0].formulationlist
FSL encoded formula : AAAAAAAABBBBαβ
Decoded formula: AAAAAAAABBBBαβ
Component | Rate | minor | |
---|---|---|---|
0 | Light Rum | 0.633 | False |
1 | Lemon Juice | 0.365 | False |
2 | Passion Fruit Syrup | 0.001 | True |
3 | Lime Juice | 0.001 | True |
4 | Formulation | 1.000 | Non additive |
latent=model.encode(c)
latent
array([[ 1.3890834 , -0.13870159, -0.00822407, -0.00487889, -0.46605322,
-0.79323816, 0.38904732, 0.3041486 , 0.11699133, 0.273327 ,
-0.09223687, 0.1689527 , 0.15887997, -0.02809681, -0.21979149,
1.4856585 , 2.5984235 , 0.10420097, -0.10993379, 0.44843948,
0.31948787, -0.09654102, 0.31869823, -0.6928068 , -0.618227 ,
-1.1512997 , -0.58362055, 0.09300974, 0.04692227, -0.29087883,
0.08301675, -0.15936494]], dtype=float32)
model.decode(latent)
'AAAAAAAABBBBαβ'
rebuilt=[model.decode(model.encode(formula)) for formula in dataset]
comparison=pd.DataFrame([[original,new] for original,new in zip(dataset,rebuilt)],
columns=["Formula","Rebuilt"])
comparison.head(20)
Formula | Rebuilt | |
---|---|---|
0 | AAAAAAAABBBBαβ | AAAAAAAABBBBαβ |
1 | AAAAAAAACCDγ | AAAAAAAACCDγ |
2 | AAAAAAAADDEE | AAAAAAAADDDE |
3 | FFFFFFFFFFFζηθι | FFFFFFFFFFFζηθι |
4 | GGGGGGHHHIIκλ | GGGGGGHHHIIκλ |
5 | AAAAAAAAAAAμν | AAAAAAAAAAAον |
6 | AAAAAAAJJJJJβξ | AAAAAAAJJJJJβξ |
7 | AAAAAAAAAAAοπ | AAAAAAAAAAAοπ |
8 | HHHHHIIIIIIρ | HHHHHIIIIIIρ |
9 | HHHHHHHHHHHστυφ | HHHHHHHHHHHστυφ |
10 | DDDDDKKKKKKχ | DDDDDKKKKKKχ |
11 | AAABBBBCCKKψ | AAABBBBCKKKψ |
12 | CCCKKKKLMω | CCCKKKKLMω |
13 | CCMMMMMNNN | CCMMMMMNNN |
14 | BBBBMMMMNN | BBBBMMMMNN |
15 | BBBBMMMNNNN | BBBBMMMNNNN |
16 | DDDMMMMMNNN | DDDMMMMMNNN |
17 | KKKMMMMMMMM | KKKMMMMMMMM |
18 | GGGGGGIIOOO | GGGGGGIIOOO |
19 | CCCMMMNNNN | CCCMMMNNNN |
from statistics import mean
import textdistance
train=mean([textdistance.sorensen(orig,new)
for orig,new in zip(dataset[:663],rebuilt[:663])])
test=mean([textdistance.sorensen(orig,new)
for orig,new in zip(dataset[663:],rebuilt[663:])])
print(f"Sørensen similarity for training set: {train*100:.2f} %")
print(f"Sørensen similarity for test set: {test*100:.2f} %")
Sørensen similarity for training set: 97.79 %
Sørensen similarity for test set: 97.95 %
Select a Formula
c=dataset[2]
print(f"FSL encoded formula : {c}")
print("Decoded formula:")
cocktails.decode([c])[0].formulationlist
FSL encoded formula : AAAAAAAADDEE
Decoded formula: AAAAAAAADDEE
Component | Rate | minor | |
---|---|---|---|
0 | Light Rum | 0.594 | False |
1 | Juice of a Lime | 0.206 | False |
2 | Powdered Sugar | 0.200 | False |
3 | Formulation | 1.000 | Non additive |
cc="EEEEE"
cocktails.decode([cc])[0].formulationlist
Component | Rate | minor | |
---|---|---|---|
0 | Powdered Sugar | 1.0 | False |
1 | Formulation | 1.0 | Non additive |
B_latent=model.encode(cc)
B_latent
array([[-6.45386398e-01, 4.85044532e-02, 8.29209983e-02,
3.41801457e-02, 7.71266997e-01, 5.36016107e-01,
2.08929375e-01, 7.18495250e-02, -3.53245795e-01,
1.99218929e-01, 4.12274413e-02, -8.70564654e-02,
1.17326975e-01, -2.18493879e-01, -2.59110242e-01,
-4.27905977e-01, -2.94935942e-01, -1.74721386e-02,
6.90681040e-02, -2.25325441e+00, -1.64082974e-01,
-7.02380240e-02, 4.02717918e-01, 6.12576544e-01,
-1.44361891e-03, 1.13856137e+00, 2.85031438e-01,
5.24719916e-02, -2.52416462e-01, 6.97316080e-02,
2.07967505e-01, -2.75261998e-02]], dtype=float32)
new=model.decode(latent-B_latent)
new=''.join(sorted(new))
new
'AAAAAAABBBBFαβ'
cocktails.decode([new])[0].formulationlist
Component | Rate | minor | |
---|---|---|---|
0 | Light Rum | 0.511 | False |
1 | Lemon Juice | 0.335 | False |
2 | Pineapple Juice | 0.153 | False |
3 | Passion Fruit Syrup | 0.001 | True |
4 | Lime Juice | 0.001 | True |
5 | Formulation | 1.000 | Non additive |
brandnew=model.generate()
cocktails.decode([brandnew])[0].formulationlist
Component | Rate | minor | |
---|---|---|---|
0 | Sweet Vermouth | 0.230 | False |
1 | Triple Sec | 0.124 | False |
2 | Powdered Sugar | 0.141 | False |
3 | Gin | 0.505 | False |
4 | Formulation | 1.000 | Non additive |
This published version is limited to
- unordered ingredients: A development version is in progress to take into account a complete sequential manufacturing process
- The cocktail generation by autoencoder's latent space exploration has been successfully tested for cocktails but it has to be assessed in other contexts
A MyBinder instance allows to run this version:
MIT
2021/2022 https://www.rd-mediation.com
@misc{Brunet2021,
author = {Brunet, L.E.},
title = {Symbolic formulation: an encoder for formulations focused on deep autoencoders},
year = {2021},
doi ={10.17601/rdmediation.2021.2}
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/l0d0v1c/SymbolicFormulation}},
commit = {bd34a46e2581e7e73878d5826ca272c1231df0fa}
}