forked from wkew/FTMSVisualization
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path1-FormulaAssignment.py
423 lines (360 loc) · 20.2 KB
/
1-FormulaAssignment.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
# -*- coding: utf-8 -*-
"""
Created on Sat Sep 12 14:58:25 2015
@author: Will Kew
will.kew@gmail.com
Copyright Will Kew, 2016
This file is part of FTMS Visualisation (also known as i-van Krevelen).
FTMS Visualisation is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
FTMS Visualisation is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with FTMS Visualisation. If not, see <http://www.gnu.org/licenses/>.
This code will automatically assign an input peak list to chemical formula.
It is designed for FTMS data.
The tool is based on the Kendrick mass defect and z* approach to formulae assignment.
See: Hsu et al. Analytica Chimica Acta, 264 (1992) 79-89
The tool reads in pre-compiled lists of possible chemical IONIC formulae (generated by 0-FormulaGenerator.py) based on the user defined ionisation mode.
That means the assigned formulae may be IONIC formulae, and this needs proper review.
The peak list is read in, and Kendrick mass properties calculated for CH2 unit.
Z stars are calculated, and homologous series with a minimum (user defined) number of members are grouped and checked against the formulae lists.
User definied errors - both relative (ppm) and absolute (Da/Th) - are used to limit false positives.
Unassigned peaks are then checked against the OH2, and finally H2 Kendrick mass units. NB: More mass units could readily be added.
Remaining unassigned peaks are then checked against the assigned peaks for being isotopologues.
i.e. For unassigned peak X m/z , is there a peak at X - 1.003355 m/z? If so, X is likely the 13C isotopologue.
This method could be expanded in a future version for more isotope peaks. Or a fuller re-write would check for isotope peaks on the initial assignment to improve accuracy.
This tool has been tested against manually assigned peaklists and PetroOrg S10.2 (The Florida State University) assigned peak lists, and
for the small test set, had similar performance levels. A rigorous benchmark of this tool has not been undertaken, and, therefore,
USE OF THIS TOOL IS AT THE USERS OWN RISK; the authors cannot be held responsible for incorrectly assigned formulae.
This tool is offered up freely, with the preceding copyright notice, for educational benefit to the user,
and will require strict evaluation and potential modification by any user for their own purposes.
################
Changelog below
#####
VERSION 0.0.4!
This version no longer uses a formula calculator, but instead consults pre-built libraries of compounds.
It may be faster to iterate through these than to loop through a formula calculator?
Version 0.0.5 sees major changes:
Switch to Pandas DataFrame instead of numpy array for the storing of assignments etc (more flexible structure)
Resolve the "which homologous series is correct" issue
Also cleaning up code by removing unnecessary sections (for now - see previous versions to get them back)
-Version 0.2 - Switched to Python3 and rewrote significant chunks, now uses Pandas extensively.
Also cut out lots of unused functions (see v0.1 for them)
-Version 0.3 - Isotopes and Other Kendrick Mass classes?
- We now calculate 13C isotopologues. Error is not calculated. Other isotopes can be added if need be.
"""
#Here we import our functions.
#import numpy as np
import pandas as pd
from datetime import datetime
from collections import Counter
from itertools import dropwhile
import sys, re, os
"""
# We import also the FTMSVizProcessingModule which contains a few useful functions.
# here we define where the scripts are stored.
# Make sure to change this to where you have saved these scripts.
"""
try: #test if running in ipython
__IPYTHON__
except NameError: #if not running in ipython....
import FTMSVizProcessingModule as FTPM
path = os.getcwd()+"\\data\\" #example data location
dictionarypath = os.getcwd()+"\\FormulaDictionaries\\"
else: #if running in ipython
#OfficeDesktopPath
scriptlocation = "F:\\Will\\Dropbox\\Documents\\University\\Edinburgh/Coding/Python3/FTMS/FTMSVisToolkit/Scripts/"
sys.path.append(scriptlocation)
import FTMSVizProcessingModule as FTPM
#OfficeDesktopPath
path = "F:/Will/Dropbox/Documents/University/Edinburgh/Coding/Python3/FTMS/FTMSVisToolkit/data/"
dictionarypath= "F:/Will/Dropbox/Documents/University/Edinburgh/Coding/Python3/FTMS/FTMSVisToolkit/FormulaDictionaries/"
#This section checks what ionisation mode you wish to generate a dictionary for. ### Future versions, move this code to the ProcessingModule? -wk
ionisationmode = input("Is your data positive or negative mode? ")
while ionisationmode.lower() != "negative" and ionisationmode.lower() != "positive":
print("Please enter either negative or positive")
ionisationmode = input("Is your data positive or negative mode? ")
else:
if ionisationmode.lower() == "negative":
ionisationmode = "negative"
elif ionisationmode.lower() == "positive":
ionisationmode = "positive"
startTime = datetime.now()
#Define some parameters for our assignment thresholds.
maxgap = 0.0003 #gap between KMDs for separating series. Units are Da, hence the small numbers. This value will be a compromise, and depend on complexity of your dataset.
# Looking at Z* and KMD method compared to other methods it seems an appropriate range could be 0.0002 or even 0.00002. Look at closer in a future version
threshold = 0.005 #Error threshold for formulae in Da - i.e. absolute error threshold.
threshppm = 1.0 # Error threshold for formulae in ppm - i.e. relative error threshold.
#isothresh = 0.0001 # threshold for isotope peak checker #Function not currently active
highlmt = 800 #highest mass to import from peaklist, in case your peak list extends higher than you have generated formulae for.
minKMDseries = 1 #minimum number of peaks in a single homologous series to assign by zstar approach # 3 seems a good number.
precisionfactor = 1000#0#0000 # This is a work in progress. THis is used for isotope hunting. A value of 1000 = 1.003355*1000, is equal to a 1 mDa error threshold, at 500 m/z this is 0.2 ppm.
CH2 = (14.0, 14.01565) #kendrick nominal mass, kendrick exact mass
OH2 = (18.0, 18.010565) #kendrick nominal mass, kendrick exact mass
CO2 = (44.0, 43.989183) #as above
H2 = (2.0, 2.01565) #as above
Oseries = (16.0, 15.994915) #as above
## Future version - define a function to calculate these on the fly? - wk
#This section loads up our formulae lists, here known as dictionaries (however they are not pythonic dictionaries)
#We only need load them up once, so the try statement checks if we have loaded yet. Reading in can take 5-10 seconds
#Double check you have made your correct formulae list prior to start of this function!
try:
dict100
except NameError:
#dictnames = ["mass","abundance","c","h","o","n","s","na","k,","homo","homoval"]
dict100 = pd.read_csv(dictionarypath+ionisationmode[:3]+"\\dict100.csv")#,header=None,names=dictnames)
dict200 = pd.read_csv(dictionarypath+ionisationmode[:3]+"\\dict200.csv")#header=None,names=dictnames)
dict300 = pd.read_csv(dictionarypath+ionisationmode[:3]+"\\dict300.csv")#,header=None,names=dictnames)
dict400 = pd.read_csv(dictionarypath+ionisationmode[:3]+"\\dict400.csv")#,header=None,names=dictnames)
dict500 = pd.read_csv(dictionarypath+ionisationmode[:3]+"\\dict500.csv")#,header=None,names=dictnames)
dict600 = pd.read_csv(dictionarypath+ionisationmode[:3]+"\\dict600.csv")#,header=None,names=dictnames)
dict700 = pd.read_csv(dictionarypath+ionisationmode[:3]+"\\dict700.csv")#,header=None,names=dictnames)
# Timing function.
def timeprint(timetext):
timenow = datetime.now() - startTime
print(str(timenow) + timetext)
# This strips out KMDs with < minKMDseries entries.
# Test function tests if the value (KMD) is in our list of approved values (okvals) ## rename this function and variables in a future version? - wk
def testfunc(i,okvals):
if i not in okvals:
a = 1
else:
a = 0
return a
#This does the stripping. Returns our dataframe minus rows with KMDs we don't want to use.
def StripMinClass(x, maxgap): # this removes from each Z* values, based on their KMD, which don't belong to a large enough group (as defined by maxgap)
x.sort_values(by="KMD",inplace=True)
KMDint = x["KMD"].values.tolist()
KMDint = [int((i*(1/maxgap))) for i in KMDint]
KMDict = Counter(KMDint)
#this checks the frequency of occurency for KMD values. We can make a list of KMD values which occur more then our "minKMDseries" and delete the rest.
for key, count in dropwhile(lambda key_count: key_count[1] >= minKMDseries, KMDict.most_common()):
del KMDict[key]
x["KMDint"] = KMDint
okvals = list(KMDict.keys())
x["test"] = x.apply(lambda row: testfunc(row["KMDint"],okvals),axis=1)
x = x[x["test"] == 0]
x = x.drop("test",axis=1)
return x
#Simple multiplication function, this is where our precision factor comes in and helps with maths/precision
def multiplyprecision(inp):
return int(inp * precisionfactor)
#This begins the formula assignment section of the program
# Functions for Formulator
#this determines which homologous series/general formula the formula provided fits into, returning a number for simplicity
## This function is in file 0 - perhaps move to the ProcessingModule? -- wk
def homochecker(o,n,s,na):
homo = str(int(o)) + str(int(n)) + str(int(s))+ str(int(na))
homoval = o + n + s + na
return homo, homoval
#This is the body of the script
#First each given peak has the relevant "dictionary" loaded
#Then each formulae in the pre-compiled list is cross referenced. Formulae within error bounds (absolute) are then filtered against (relative) errors.
#If the errors are acceptable, the formulae is decided to be a hit, and the entry added to the assigned peak list.
#There is no redundnacy for assigning the same peak to multiple formulae in this version.
# This is not forseen to be a problem in high res, CHO spectra. IT may be an issue when more heteroatoms are considered and/or broader error ranges used.
def form_checker(low,high,mass,threshppm,intensity):
allposs = []
i=0
if 100.0 <= mass <= 700.0:
if 100 <= mass < 200:
dicta = dict100
elif 200.0 <= mass < 300.0:
dicta = dict200
elif 300.0 <= mass < 400.0:
dicta = dict300
elif 400.0 <= mass < 500.0:
dicta = dict400
elif 500.0 <= mass < 600.0:
dicta = dict500
elif 600.0 <= mass < 700.0:
dicta = dict600
elif 700.0 <= mass < 800.0:
dicta = dict700
for x in dicta.itertuples():
if low <= x[1] <=high:
error = ((mass - x[1])/x[1])*1000000
if abs(error) <= threshppm:
allposs.append(list(x[1:9]))
allposs[i].append(error)
allposs[i].append(intensity)
allposs[i].append(mass)
formulatemp = FTPM.formulator(int(x[3]),int(x[4]),int(x[6]),int(x[5]),int(x[7]),int(x[8]),int(x[9]),int(x[10]),ionisationmode)
dbe = FTPM.DBEcalc(int(x[3]),int(x[4]),int(x[6]),ionisationmode)
allposs[i].append(dbe)
allposs[i].append(formulatemp)
i = i +1
return allposs
# Calculates the kendrick mass, nominal kendrick mass, kendrick mass defect, and Z star.
def calcprop(xmass,kendrickseries): #Should now work with any kendrick mass series - CH2, OH2, etc.
kmass = xmass * (kendrickseries[0]/kendrickseries[1])
if kmass % 2 == 0:
nmass = int(kmass)
else:
nmass = int(kmass)+1
kmd = nmass - kmass
zstar = ((nmass % int(kendrickseries[0])) - int(kendrickseries[0]))*-1
return kmass,nmass,kmd,zstar
#This calculates the kendrick mass properties for a given peak. It calls the earlier kendrick mass function.
def kmdpart(output,peaksfloat,kendrickseries):
#This bit calculates the kendrick mass, Nominal mass, kmd, and z* for each input peak
KMD = [] #Kendrick mass defect
NKM = [] #Nominal kendrick mass
KM = [] #Kendrick Mass
Zstar = [] #Z*
i=0
for p in peaksfloat: #We have to work with our peaks as floats here
kmass,nmass,kmd,zstar = calcprop(p,kendrickseries)
KMD.append(kmd)
NKM.append(nmass)
KM.append(kmass)
Zstar.append(zstar)
i = i + 1
output["KM"]=KM
output["NKM"]=NKM
output["KMD"]=KMD
output["Z*"]=Zstar
timeprint(" Time to calculate the kendrick properties for " +str(kendrickseries[0]) +" = "+ str(kendrickseries[1]))
#This bit splits the output into the z*s. The zs are a list of lists. This allows n-z* groups, and thus any kendrick mass unit should work.
output.sort_values(by="Z*",inplace=True)
z = output.groupby('Z*')
zs = []
for x in z.groups.keys():
y = z.get_group((x))
y = StripMinClass(y.copy(),maxgap)
zs.append(y)
return zs #list of lists
#This section calls together a few functions as we process a given list of z-stars. It is called for each kendrick mass unit we are using.
def assigningpart(zs):
assigned = []
for z in zs:
for y in z.itertuples():
formulae=[]
mass = y[1]
intensity = y[2]
low = mass - threshold #error threshold (absolute)
high = mass + threshold
allposs = form_checker(low,high,mass,threshppm,intensity)
formulae.append(allposs) #Allposs is a list comprising the elements from the dictionary for matching formula - i.e. exact mass, kmd, C, H, O, N, S, Na, homo, homosum.
assigned.append(formulae)
new = [item for it in assigned for item in it] #variables may need to be cleaned up - wk
assigned = [item for it in new for item in it] #variables may need to be cleaned up - wk
#this set of outheaders is designed to fit the other scripts which were written prior to this. As such, certain columns may seem redunandt
outheaders = ["Theor. Mass","Isotopic Abundance","C","H","O","N","S","Na","Error","Rel. Abundance","Exp. m/z","DBE","Formula"]
assignedDF = pd.DataFrame(assigned,columns=outheaders)
newcols = ["Exp. m/z","Theor. Mass","Error","Rel. Abundance","DBE","C","H","N","O","S","Formula"]#,"Nano","Isotopic Abundance"]
if ionisationmode =="negative":
assignedDF["Hno"] = assignedDF["H"].add(1)
assignedDF = assignedDF[newcols]
assignedDF.sort_values(by="Exp. m/z",inplace=True)
formulae = assignedDF["Formula"]
heteroclasses = []
for x in formulae:
split = re.split('([H][\d]+)',x)
heteroclasses.append(split[-1])
assignedDF["HeteroClass"] = heteroclasses
return assignedDF
def isotopechecker(unassignedDF,assignedDF):
#this section with isotopes could be significant expanded and refined.
c13 =1.003355 #exact mass difference of replacing a 12C atom with a 13C atom.
c13 = int(c13*precisionfactor) #How precise are we
isotopeheaders = ["Exp. m/z","Recal m/z","Theor. Mass","Error","Rel. Abundance","Signal2Noise","DBE","C","H","N","O","S","13C","18O","Formula","HeteroClass"]
a= 0
newlist = []
for x in unassignedDF.itertuples():
Xmass = int(x[1]*precisionfactor)
for y in assignedDF.itertuples():
Ymass = int(y[1]*precisionfactor)
if Xmass-Ymass == c13:
formula = FTPM.isotopeformulator(y[6]-1,y[7],y[8],y[9],y[10],0,0,ionisationmode,1,0)
newlist.append([x[1],x[1],x[1],0,x[2],0,y[5],y[6]-1,y[7],y[8],y[9],y[10],1,0,formula,y[12]])
a= a+1
#isotopologues = pd.DataFrame(data=newlist,columns=isotopeheaders)
#print("There were " + str(a) + " isotopologues identified")
c13 =1.003355 #exact mass difference of replacing a 12C atom with a 13C atom.
c13 = int(2*c13*precisionfactor) #How precise are we
#isotopeheaders = ["Exp. m/z","Recal m/z","Theor. Mass","Error","Rel. Abundance","Signal2Noise","DBE","Cno","Hno","Nno","Ono","Sno","13Cno","Formula","HeteroClass"]
#a= 0
#newlist = []
for x in unassignedDF.itertuples():
Xmass = int(x[1]*precisionfactor)
for y in assignedDF.itertuples():
Ymass = int(y[1]*precisionfactor)
if Xmass-Ymass == c13:
formula = FTPM.isotopeformulator(y[6]-1,y[7],y[8],y[9],y[10],0,0,ionisationmode,2,0)
newlist.append([x[1],x[1],x[1],0,x[2],0,y[5],y[6]-1,y[7],y[8],y[9],y[10],2,0,formula,y[12]])
a= a+1
#isotopologues = pd.DataFrame(data=newlist,columns=isotopeheaders)
#print("There were " + str(a) + " isotopologues identified")
#this section with isotopes could be significant expanded and refined.
o18 =2.004244 #exact mass difference of replacing a 16O atom with a 18O atom.
o18 = int(o18*precisionfactor) #How precise are we
#isotopeheaders = ["Exp. m/z","Recal m/z","Theor. Mass","Error","Rel. Abundance","Signal2Noise","DBE","Cno","Hno","Nno","Ono","Sno","13Cno","Formula","HeteroClass"]
#a= 0
#newlist = []
for x in unassignedDF.itertuples():
Xmass = int(x[1]*precisionfactor)
for y in assignedDF.itertuples():
Ymass = int(y[1]*precisionfactor)
if Xmass-Ymass == o18:
formula = FTPM.isotopeformulator(y[6]-1,y[7],y[8],y[9],y[10],0,0,ionisationmode,0,1)
newlist.append([x[1],x[1],x[1],0,x[2],0,y[5],y[6]-1,y[7],y[8],y[9],y[10],0,1,formula,y[12]])
a= a+1
isotopologues = pd.DataFrame(data=newlist,columns=isotopeheaders)
print("There were " + str(a) + " isotopologues identified")
return isotopologues
#This main body of code performs the script for us. It will read in the data, and assign peaks.
def mainbody(filename):
data = pd.read_csv(filename,delimiter='\t')
data.dropna(inplace=True,axis=1)
#headers = ["m/z","I","Res.","KM","NKM","KMD","Z*"] #keeps track of what your columns mean
#npeaks = len(data) #calculates number of peaks
#nheaders = len(headers)
output = pd.DataFrame(data[data["m/z"]<highlmt])
peaksfloat = output["m/z"].tolist()
#output["m/zint"] = output["m/z"].apply(multiplyprecision)
#peaks = output["m/zint"].tolist()
timeprint(" Time to read in and sort the data")
zs = kmdpart(output,peaksfloat,CH2)
#zs = kmdpart(output,peaksfloat,CO2) #this build based on CO2 series - doesn't seem to get any hits?
#zs = kmdpart(output,peaksfloat,H2) #this build based on H2 series - Lots of hits/too many?
#This ends the reading and sorting of data
assignedDF = assigningpart(zs)
unassignedDF = data[~data["m/z"].isin(assignedDF["Exp. m/z"])].dropna()
output = pd.DataFrame(unassignedDF[unassignedDF["m/z"]<highlmt])
peaksfloat = output["m/z"].tolist()
zs = kmdpart(output,peaksfloat,OH2)
assignedDF = pd.concat((assignedDF,assigningpart(zs)), ignore_index=True)
unassignedDF = data[~data["m/z"].isin(assignedDF["Exp. m/z"])].dropna()
output = pd.DataFrame(unassignedDF[unassignedDF["m/z"]<highlmt])
peaksfloat = output["m/z"].tolist()
zs = kmdpart(output,peaksfloat,H2)
assignedDF = pd.concat((assignedDF,assigningpart(zs)), ignore_index=True)
assignedDF = assignedDF.rename(columns={"Rel. Abundance":"Abundance"})
assignedhitcount = str(len(assignedDF))
print("There were " +assignedhitcount +" monoisotopic formulae assigned")
isotopologues = isotopechecker(unassignedDF,assignedDF)
isotopologues = isotopologues.drop("Recal m/z",axis=1)
isotopologues = isotopologues.rename(columns={"Rel. Abundance":"Abundance"})
unassignedDF = data[~data["m/z"].isin(assignedDF["Exp. m/z"])].dropna()
unassignedDF = unassignedDF[~unassignedDF["m/z"].isin(isotopologues["Exp. m/z"])].dropna()
unassignedDF = unassignedDF.rename(columns={'m/z': 'Exp. m/z', 'I': 'Abundance'})
timeprint(" Time to complete")
return assignedDF, isotopologues, unassignedDF
def godo(fileloc,filen):
assignedDF,isotopologues, unassignedDF = mainbody(fileloc)
FTPM.make_sure_path_exists(path +"/OutputCSV/") #this function checks the output directory exists; if it doesnt, it creates it.
assignedDF.to_csv(path+"/OutputCSV/"+filen[:-4]+"-hits.csv")
isotopologues.to_csv(path+"/OutputCSV/"+filen[:-4]+"-isohits.csv")
unassignedDF.to_csv(path+"/OutputCSV/"+filen[:-4]+"-nohits.csv")
filesA = os.listdir(path+"InputPeaklist/")
for i in filesA:
if i[-4:] == ".txt":
fileloc = path +"InputPeaklist/" + i
godo(fileloc,i)
print("EOF")