Skip to content

l0d0v1c/SymbolicFormulation

Repository files navigation

Formulation Symbolic language

SymbolicFormulation

FSL is based on "Formulate" library available at https://github.com/l0d0v1c/formulate and the local unit FSL.py

#!pip install "https://github.com/l0d0v1c/formulate/blob/main/dist/formulate-1.3-py3-none-any.whl?raw=true"
# uncomment the line above to install Formulate
from formulate.components import components
from FSL import formulationsymboliclanguage

Purpose

FSL is a language focused on formulation description and deep learning. A formulation is a list of ingredients and quantities. FSL transforms this recipe in a string inspired by SMILES language used to represent molecules. These strings may be used for instance to train a deep auto encoder and generate new formulations from existing ones

Encoding process

Ingredients can be either major or minor. Major components are the ones usually present in significant amount, minor ones are usually additives used to modify properties of the formulation, like colouring or viscosity agents. Major ingredients are encoded in latin alphabet and minor one is greek. To be included in FSL each FORMULATE object must embed a minor <True|False> property.

Example

Considering Air as Oxygen/Nitrogen major ingredients and a minor water additive

c=components(physical={"∆Hf":True,"rho":None,"minor":None})
c.add("Water","H2O",{'∆Hf':-285.83,"rho":1.0,'minor':True})
c.add("Nitrogen","N2",{'∆Hf':0,"rho":0.01,'minor':False})
c.add("Oxygen","O2",{'∆Hf':0,"rho":0.01,'minor':False})
c.setrates({"Water":0.01,"Oxygen":0.19,'Nitrogen':0.8})
c.mixing()
Component Rate N O H ∆Hf rho minor
0 Water 0.01 0.0000 55.50800 111.01700 -15865.9700 1.0 1
1 Nitrogen 0.80 71.3940 0.00000 0.00000 0.0000 0.01 0
2 Oxygen 0.19 0.0000 62.50200 0.00000 0.0000 0.01 0
3 Formulation 1.00 57.1152 12.43046 1.11017 -158.6597 Non additive Non additive

We can now encode the air formulation as

from IPython.display import display, HTML
f=formulationsymboliclanguage([c])
e=f.encode([c])
display(HTML(f"<span style='font-size:3em'>{e[0]}</span>"))

ABα

The dictionary of ingredients is

f.dict
{'Water': 'α', 'Nitrogen': 'A', 'Oxygen': 'B'}

Formulation list with several quantities

To train an autoencoder we need a list of formulations having the same ingredients at several quantities. During the FSL initialisation process you can define a "dose". In formulation recipes, the quantity of each component is often given in units (oz, parts..). FSL uses the same representation:

formulationsymboliclanguage(formulae,granulo=5)

means for each ingredient the delta between the maximum and the minimum quantity is splitted in 5 doses. So CCCD means 3 doses of C and one of D. Minor components are only represented by one letter.

Let's try encoding a recipes book of cocktails

import pandas as pd
df=pd.read_excel("cocktails.xlsx")
df.head()
Unnamed: 0 nom categ i1 d1 i2 d2 i3 d3 i4 d4 i5 d5 i6 d6
0 0 Gauguin Cocktail Classics Light Rum 2.0 Passion Fruit Syrup 1.0 Lemon Juice 1.00 Lime Juice 1.00 NaN NaN NaN NaN
1 1 Fort Lauderdale Cocktail Classics Light Rum 1.5 Sweet Vermouth 0.5 Juice of Orange 0.25 Juice of a Lime 0.25 NaN NaN NaN NaN
2 2 Apple Pie Cordials and Liqueurs Apple schnapps 3.0 Cinnamon schnapps 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
3 3 Cuban Cocktail No. 1 Cocktail Classics Juice of a Lime 0.5 Powdered Sugar 0.5 Light Rum 2.00 NaN NaN NaN NaN NaN NaN
4 4 Cool Carlos Cocktail Classics Dark rum 1.5 Cranberry Juice 2.0 Pineapple Juice 2.00 Orange curacao 1.00 Sour Mix 1.0 NaN NaN

Now we have to transform this sheet in a list of formulations. As many ingredients are only used a few times they are not usable for a deep learning training. So we can limit the major ingredients list to the ones uses in more than 30 recipes. The rare ingredients are represented as minors

from collections import Counter
ingredients=[]
for i in range(1,7):
    for j in df[f"i{i}"].tolist():
        ingredients.append(j)
ingredients=Counter(ingredients)
composant={}
for name,cnt in ingredients.items():
    if cnt>30:
        composant[name]={'minor':False}
print(f"based on {len(composant)} ingredients")
listcompo=[]
for i,j in df.iterrows():
    try:
        cp=components(physical={"minor":None})
        rates={}
        for k in range(1,7):
            if j[f"d{k}"]==j[f"d{k}"] and j[f"i{k}"]==j[f"i{k}"] : #not nan
                name=j[f"i{k}"]
                if name in composant:
                    rate=j[f"d{k}"]
                    cp.add(name,"",{'minor':False})
                    rates[name]=rate
                else:
                    cp.add(name,"",{'minor':True})
                    rates[name]=0.001
                    
        cp.setrates(rates)
        cp.mixing()
    except:
        pass
    listcompo.append(cp)
based on 23 ingredients

For instance we can inpect the first cocktail

listcompo[0].formulationlist
Component Rate minor
0 Light Rum 0.666 0
1 Passion Fruit Syrup 0.000 1
2 Lemon Juice 0.333 0
3 Lime Juice 0.000 1
4 Formulation 1.000 Non additive

Then encode the full recipe's book

cocktails=formulationsymboliclanguage(listcompo,granulo=10,verbose=False)

As the number of minor ingredients is limited to the length of the greek alphabet some of them are not encoded. It is possible to use longer alphabet by changing the lists

formulationsymboliclanguage.major=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
formulationsymboliclanguage.major=list('αβγδεζηθικλμνξοπρστυφχψω')

so we can now get an encoded training set. You may display unencoded ingredients by specifying verbose=True

encoded=cocktails.encode(listcompo)

The first cocktail is encoded as

display(HTML(f"<div style='font-size:3em;'>Encoded recipe 0 : {encoded[0]}</div>"))
Encoded recipe 0 : AAAAAAAABBBBαβ

If you check what means A

name={j:i for i,j in cocktails.dict.items()}['A']
print(f"Ingredient: {name}")
print(f"Minimum in recipes : {cocktails.min[name]}, maximum: {cocktails.max[name]}")
print(f"One dose of {name} is {cocktails.delta[name]}")
Ingredient:  Light Rum
Minimum in recipes : 0.04, maximum: 1.0
One dose of  Light Rum is 0.096

Encoding is a balance between accuracy (as the quantities are encoded as a number of doses) and the number of available recipes. Having long encoded FSL strings gives a good accuracy but requires a lot of recipes to train a deep encoder. For instance, let's decode the encoded recipe

display(HTML("<span style='font-size:2em;'>FSL encoded recipe is:</span>"))
display(cocktails.decode([encoded[0]])[0].formulationlist)
display(HTML("<span style='font-size:2em;'>And the original recipe was:</span>"))
display(listcompo[0].formulationlist)

FSL encoded recipe is:

Component Rate minor
0 Light Rum 0.633 False
1 Lemon Juice 0.365 False
2 Passion Fruit Syrup 0.001 True
3 Lime Juice 0.001 True
4 Formulation 1.000 Non additive

And the original recipe was:

Component Rate minor
0 Light Rum 0.666 0
1 Passion Fruit Syrup 0.000 1
2 Lemon Juice 0.333 0
3 Lime Juice 0.000 1
4 Formulation 1.000 Non additive

Assessment in a variational autoencoder

Assessment of FSL language in an autoencoder. In this version, we will reload a pretrained neural network.

Reloading the pretrained neural network

#!pip install tensorflow pandas textdistance
import pickle,gzip,sys
from rdmediationvaert import AE
import pandas as pd
cocktails,encodeur=pickle.load(gzip.open("cocktails.pklz"))
dataset=[]
for m in encodeur:
    if len(m)>2:
        dataset.append(m)
print(f"{len(dataset)} formulae for training")
model=AE(name='cocktailsvae')
model.reload('cocktailsvae')

Load a formula

c=dataset[0]
print(f"FSL encoded formula : {c}")
print("Decoded formula:")
cocktails.decode([c])[0].formulationlist
FSL encoded formula : AAAAAAAABBBBαβ
Decoded formula: AAAAAAAABBBBαβ
Component Rate minor
0 Light Rum 0.633 False
1 Lemon Juice 0.365 False
2 Passion Fruit Syrup 0.001 True
3 Lime Juice 0.001 True
4 Formulation 1.000 Non additive

Find it in the latent space

latent=model.encode(c)
latent
array([[ 1.3890834 , -0.13870159, -0.00822407, -0.00487889, -0.46605322,
        -0.79323816,  0.38904732,  0.3041486 ,  0.11699133,  0.273327  ,
        -0.09223687,  0.1689527 ,  0.15887997, -0.02809681, -0.21979149,
         1.4856585 ,  2.5984235 ,  0.10420097, -0.10993379,  0.44843948,
         0.31948787, -0.09654102,  0.31869823, -0.6928068 , -0.618227  ,
        -1.1512997 , -0.58362055,  0.09300974,  0.04692227, -0.29087883,
         0.08301675, -0.15936494]], dtype=float32)

Rebuild it back

model.decode(latent)
'AAAAAAAABBBBαβ'

Assess performance

rebuilt=[model.decode(model.encode(formula)) for formula in dataset]

comparison=pd.DataFrame([[original,new] for original,new in zip(dataset,rebuilt)],
                       columns=["Formula","Rebuilt"])
comparison.head(20)
Formula Rebuilt
0 AAAAAAAABBBBαβ AAAAAAAABBBBαβ
1 AAAAAAAACCDγ AAAAAAAACCDγ
2 AAAAAAAADDEE AAAAAAAADDDE
3 FFFFFFFFFFFζηθι FFFFFFFFFFFζηθι
4 GGGGGGHHHIIκλ GGGGGGHHHIIκλ
5 AAAAAAAAAAAμν AAAAAAAAAAAον
6 AAAAAAAJJJJJβξ AAAAAAAJJJJJβξ
7 AAAAAAAAAAAοπ AAAAAAAAAAAοπ
8 HHHHHIIIIIIρ HHHHHIIIIIIρ
9 HHHHHHHHHHHστυφ HHHHHHHHHHHστυφ
10 DDDDDKKKKKKχ DDDDDKKKKKKχ
11 AAABBBBCCKKψ AAABBBBCKKKψ
12 CCCKKKKLMω CCCKKKKLMω
13 CCMMMMMNNN CCMMMMMNNN
14 BBBBMMMMNN BBBBMMMMNN
15 BBBBMMMNNNN BBBBMMMNNNN
16 DDDMMMMMNNN DDDMMMMMNNN
17 KKKMMMMMMMM KKKMMMMMMMM
18 GGGGGGIIOOO GGGGGGIIOOO
19 CCCMMMNNNN CCCMMMNNNN

Sørensen text distance

from statistics import mean 
import textdistance
train=mean([textdistance.sorensen(orig,new) 
            for orig,new in zip(dataset[:663],rebuilt[:663])])
test=mean([textdistance.sorensen(orig,new) 
            for orig,new in zip(dataset[663:],rebuilt[663:])])
print(f"Sørensen similarity for training set: {train*100:.2f} %")
print(f"Sørensen similarity for test set: {test*100:.2f} %")
Sørensen similarity for training set: 97.79 %
Sørensen similarity for test set: 97.95 %

Examples of use

Ingredient replacement

Select a Formula

c=dataset[2]
print(f"FSL encoded formula : {c}")
print("Decoded formula:")
cocktails.decode([c])[0].formulationlist
FSL encoded formula : AAAAAAAADDEE
Decoded formula: AAAAAAAADDEE
Component Rate minor
0 Light Rum 0.594 False
1 Juice of a Lime 0.206 False
2 Powdered Sugar 0.200 False
3 Formulation 1.000 Non additive

Find an ingredient in the latent space

cc="EEEEE"
cocktails.decode([cc])[0].formulationlist
Component Rate minor
0 Powdered Sugar 1.0 False
1 Formulation 1.0 Non additive
B_latent=model.encode(cc)
B_latent
array([[-6.45386398e-01,  4.85044532e-02,  8.29209983e-02,
         3.41801457e-02,  7.71266997e-01,  5.36016107e-01,
         2.08929375e-01,  7.18495250e-02, -3.53245795e-01,
         1.99218929e-01,  4.12274413e-02, -8.70564654e-02,
         1.17326975e-01, -2.18493879e-01, -2.59110242e-01,
        -4.27905977e-01, -2.94935942e-01, -1.74721386e-02,
         6.90681040e-02, -2.25325441e+00, -1.64082974e-01,
        -7.02380240e-02,  4.02717918e-01,  6.12576544e-01,
        -1.44361891e-03,  1.13856137e+00,  2.85031438e-01,
         5.24719916e-02, -2.52416462e-01,  6.97316080e-02,
         2.07967505e-01, -2.75261998e-02]], dtype=float32)

Remove the ingredient and brew a new cocktail

new=model.decode(latent-B_latent)
new=''.join(sorted(new))
new
'AAAAAAABBBBFαβ'
cocktails.decode([new])[0].formulationlist
Component Rate minor
0 Light Rum 0.511 False
1 Lemon Juice 0.335 False
2 Pineapple Juice 0.153 False
3 Passion Fruit Syrup 0.001 True
4 Lime Juice 0.001 True
5 Formulation 1.000 Non additive

Create a new cocktail

Locate a random latent space vector

brandnew=model.generate()
cocktails.decode([brandnew])[0].formulationlist
Component Rate minor
0 Sweet Vermouth 0.230 False
1 Triple Sec 0.124 False
2 Powdered Sugar 0.141 False
3 Gin 0.505 False
4 Formulation 1.000 Non additive

Limits of the current version

This published version is limited to

  • unordered ingredients: A development version is in progress to take into account a complete sequential manufacturing process
  • The cocktail generation by autoencoder's latent space exploration has been successfully tested for cocktails but it has to be assessed in other contexts

Running test

A MyBinder instance allows to run this version:

Binder

Licence

MIT

2021/2022 https://www.rd-mediation.com

Cite

@misc{Brunet2021,
  author = {Brunet, L.E.},
  title = {Symbolic formulation: an encoder for formulations focused on deep autoencoders},
  year = {2021},
  doi ={10.17601/rdmediation.2021.2}
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/l0d0v1c/SymbolicFormulation}},
  commit = {bd34a46e2581e7e73878d5826ca272c1231df0fa}
}