Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine learning project module #746

Draft
wants to merge 80 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
a6b1639
correct Within calculation
Nov 18, 2021
fa4da8b
update unit tests
Nov 18, 2021
3246567
conflicts resolved back to upstream
Feb 4, 2022
a018d4d
Merge remote-tracking branch 'upstream/master'
Feb 15, 2022
15a37d0
Merge remote-tracking branch 'upstream/master'
Feb 17, 2022
892fa45
this is the spot
Feb 18, 2022
211013c
Merge remote-tracking branch 'upstream/master'
Feb 25, 2022
68104ee
Merge branch 'master' of https://github.com/trishorts/mzLib
trishorts Mar 9, 2022
d715a08
Merge remote-tracking branch 'upstream/master'
Mar 16, 2022
3565522
Merge remote-tracking branch 'upstream/master'
Mar 23, 2022
72e7b53
Merge remote-tracking branch 'upstream/master'
Mar 29, 2022
593872a
Merge remote-tracking branch 'upstream/master'
trishorts Apr 13, 2022
42dd034
Merge branch 'master' of https://github.com/trishorts/mzLib
trishorts Apr 13, 2022
fbeaec0
Merge remote-tracking branch 'upstream/master'
trishorts Jun 1, 2022
614ded7
Merge remote-tracking branch 'upstream/master'
Jun 14, 2022
47307c8
Merge branch 'master' of https://github.com/trishorts/mzLib
Jun 14, 2022
28e05ae
Merge remote-tracking branch 'upstream/master'
Jul 6, 2022
0a7c609
Merge remote-tracking branch 'upstream/master'
Jul 26, 2022
630d8c7
Merge remote-tracking branch 'upstream/master'
trishorts Jul 27, 2022
f6a386b
Merge branch 'master' of https://github.com/trishorts/mzLib
trishorts Jul 27, 2022
d673800
Merge remote-tracking branch 'upstream/master'
Sep 11, 2022
675a0ae
Merge branch 'master' of https://github.com/trishorts/mzLib
Sep 11, 2022
15d4baf
Merge remote-tracking branch 'upstream/master'
Sep 27, 2022
03ca9f7
Merge remote-tracking branch 'upstream/master'
Oct 4, 2022
d0a4c79
Merge remote-tracking branch 'upstream/master'
Jan 30, 2023
894b998
Merge remote-tracking branch 'upstream/master'
Mar 15, 2023
88269a1
Merge remote-tracking branch 'upstream/master'
trishorts Apr 24, 2023
9a9b24a
Merge remote-tracking branch 'upstream/master'
trishorts Jun 29, 2023
b4ad231
add space
trishorts Jun 29, 2023
bc59b38
Merge remote-tracking branch 'upstream/master'
trishorts Oct 10, 2023
f3c83ae
first move
trishorts Nov 6, 2023
d6d934b
psmFromTsv unit tests
trishorts Nov 6, 2023
2db71cd
moved library spectrum
trishorts Nov 6, 2023
562f69d
empty unit test for library spectrum
trishorts Nov 6, 2023
d3dcbe9
m
trishorts Nov 6, 2023
2c4334a
library spectrum unit tests
trishorts Nov 7, 2023
a86d68e
lib spec unit tests
trishorts Nov 7, 2023
c7ce32d
PSMTSV unit tests
trishorts Nov 7, 2023
c610791
add tests for variants and localized glycans
trishorts Nov 7, 2023
5e09c14
capitalization convention
trishorts Nov 7, 2023
9055644
read internal ions test
trishorts Nov 7, 2023
74b80ad
uncomment lines
trishorts Nov 7, 2023
d1bc75c
moved fragmentation and library spectrum to new project Omics
trishorts Nov 8, 2023
cec311a
Revert "moved fragmentation and library spectrum to new project Omics"
trishorts Nov 9, 2023
8d88b32
someInterfaces
trishorts Nov 9, 2023
df0f605
good midpont
trishorts Nov 9, 2023
cad0d1c
omics classes and interfaces seem tobe working
trishorts Nov 9, 2023
8991e14
move LibrarySpectrum class to Omics. Create SpectrumMatchFromTsvHeade…
trishorts Nov 10, 2023
02bf807
not working
trishorts Nov 15, 2023
b7d15d6
Fixed up the PR
nbollis Nov 15, 2023
2502322
Merge pull request #2 from trishorts/tempPsmFromTsv
trishorts Nov 16, 2023
924e99f
fix broken test
trishorts Nov 16, 2023
10f53a2
some unit tests
trishorts Nov 16, 2023
d0a55b2
dhg
trishorts Nov 16, 2023
81f9338
Expanded test coverage on file classes
nbollis Nov 16, 2023
382c0da
new header and xlink psmtsv reader unit tests
trishorts Nov 20, 2023
3abe9a3
CPU(windows, linux, and mac) dll
elaboy Nov 20, 2023
71c3ead
init
elaboy Nov 21, 2023
7a84810
Merge branch 'pr/737' into TrainingMethodsForChronologer
elaboy Nov 21, 2023
79e3d09
Custom Datasets and training functions
elaboy Nov 21, 2023
848f81c
cool progress
elaboy Nov 21, 2023
d8576aa
training working
elaboy Nov 22, 2023
81fe5b6
Working
elaboy Nov 24, 2023
d0c83ec
training working
elaboy Nov 27, 2023
cdba64f
MachineLearningProject/Abstraction
elaboy Nov 27, 2023
e5a6c67
Abstraction Progress
elaboy Nov 28, 2023
65e15f6
Chronologer model might be done
elaboy Nov 28, 2023
fafd9c6
some tests
elaboy Nov 28, 2023
9ac8a76
Before clean up
elaboy Nov 30, 2023
a250eb8
removing unused files
elaboy Nov 30, 2023
3cb9478
CPU
elaboy Nov 30, 2023
a5c104d
Merge branch 'master' into MachineLearningProjectModule
elaboy Nov 30, 2023
56af0b6
Revert "Merge branch 'master' into MachineLearningProjectModule"
elaboy Nov 30, 2023
5af0151
Update Test.csproj
elaboy Nov 30, 2023
b8b8089
Update TestChronologer.cs
elaboy Dec 4, 2023
f393237
fixed conflicts
elaboy Dec 7, 2023
992448b
nuspec changed
elaboy Dec 7, 2023
84de29c
Merge remote-tracking branch 'upstream/master' into MachineLearningPr…
elaboy Dec 8, 2023
18aecce
added unimod xml file for modification loading in the machine learnin…
elaboy Dec 11, 2023
cda0872
Class rename and diagram
elaboy Dec 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions mzLib/MachineLearning/DeepTorch.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
using MathNet.Numerics.Statistics;
using Proteomics.PSM;
using TorchSharp;
using TorchSharp.Modules;

namespace MachineLearning
{
/// <summary>
/// Abstract class for one dimensional input size TorchSharp models.
/// </summary>
public abstract class DeepTorch : torch.nn.Module<torch.Tensor, torch.Tensor>
{
public DeepTorch(string preTrainedModelPath = null, bool evalMode = true,
DeviceType device = DeviceType.CPU)
: base(nameof(DeepTorch))
{
RegisterComponents();

if (preTrainedModelPath != null)
LoadPreTrainedModel(preTrainedModelPath);

if (evalMode)
EvaluationMode();
else
TrainingMode();

this.to(device);
}

/// <summary>
/// Defines the computation performed at every call. Do not use for inferring, use .Predict() instead.
/// </summary>
/// <param name="input"></param>
/// <returns></returns>
public abstract override torch.Tensor forward(torch.Tensor input);

public virtual torch.Tensor Predict(torch.Tensor input)
{
return this.call(input);
}

/// <summary>
/// Defines the behavior of the module during training phase.
/// </summary>
public abstract void Train(string modelSavingPath, List<PsmFromTsv> trainingData,
Dictionary<(char, string), int> dictionary, DeviceType device, float validationFraction,
float testingFraction, int batchSize, int epochs, int patience);

/// <summary>
/// Defines the behavior of the model when loading pre-trained weights.
/// </summary>
/// <param name="weightsPath"></param>
/// <param name="strict"></param>
public virtual void LoadPreTrainedModel(string weightsPath, bool strict = true)
{
this.load(weightsPath, strict);
}

/// <summary>
/// Stipulates an early stop condition to prevent overfitting.
/// </summary>
protected virtual bool EarlyStop(double score, double currentBestScore, int currentPatience,
out double bestScore, out int patience)
{
if (score < currentBestScore && currentPatience > 0)
{
bestScore = score;
patience = currentPatience - 1;
return false;

Check warning on line 69 in mzLib/MachineLearning/DeepTorch.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DeepTorch.cs#L66-L69

Added lines #L66 - L69 were not covered by tests
}

bestScore = currentBestScore;
patience = 0;
return true;

}

/// <summary>
/// Generates a checkpoint to save the model at indicated step.
/// </summary>
protected virtual void Checkpoint(string checkPointPath, int epoch)
{

Check warning on line 82 in mzLib/MachineLearning/DeepTorch.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DeepTorch.cs#L82

Added line #L82 was not covered by tests

Directory.CreateDirectory(checkPointPath);
Directory.CreateDirectory(checkPointPath + "/Model" + epoch);
SaveModel(checkPointPath + "/Model" + epoch + "checkpointModel");
}

Check warning on line 87 in mzLib/MachineLearning/DeepTorch.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DeepTorch.cs#L84-L87

Added lines #L84 - L87 were not covered by tests

protected virtual void ModelPerformance(string savingPath, double bestScore, double accuracy,
List<double> predictions,
List<double> labels)
{
using (var writter = new System.IO.StreamWriter(savingPath + ".txt"))
{
writter.WriteLine($"Best score: {bestScore}");
writter.WriteLine($"Accuracy: {accuracy}");
writter.WriteLine($"R^2: " + Correlation.Pearson(predictions, labels));
writter.WriteLine($"Predictions: {predictions}");
writter.WriteLine($"Labels: {labels}");
}
}
/// <summary>
/// Learning Rate Decay.
/// </summary>
protected virtual torch.optim.lr_scheduler.LRScheduler _scheduler =>
torch.optim.lr_scheduler.StepLR(new Adam(this.parameters()), 25, 0.1);

Check warning on line 106 in mzLib/MachineLearning/DeepTorch.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DeepTorch.cs#L106

Added line #L106 was not covered by tests

/// <summary>
/// Sets model into training mode.
/// </summary>
public virtual void TrainingMode()
{
this.train(true);
}

/// <summary>
/// Sets model into evaluation mode.
/// </summary>
public virtual void EvaluationMode()
{
this.eval();
this.train(false);
}

/// <summary>
/// Saves the trained weights as a .dat file (TorchSharp format).
/// </summary>
/// <param name="modelSavingPath"></param>
public virtual void SaveModel(string modelSavingPath)
{
EvaluationMode();

if (modelSavingPath.EndsWith(".dat"))
this.save(modelSavingPath);
else
this.save(modelSavingPath + ".dat");
}

protected abstract double Validate(DataLoader? validationDataLoader,
torch.nn.Module<torch.Tensor, torch.Tensor, torch.Tensor> criterion, DeviceType device);
protected abstract (double, List<double>, List<double>) Test(DataLoader? testingDataLoader,
torch.nn.Module<torch.Tensor, torch.Tensor, torch.Tensor> criterion, DeviceType device);
public abstract TorchDataset? TrainingDataset { get; set; }
public abstract TorchDataset? TestingDataset { get; set; }
public abstract TorchDataset? ValidationDataset { get; set; }
public abstract torch.Tensor Tensorize(object toTensoize);

// public abstract Dictionary<string, torch.Tensor> GetTensor(long index);

// sets nullable filed Dataset to a new instance of a TorchDataset
public abstract void CreateDataSet(List<PsmFromTsv> data, float validationFraction, float testingFraction, int batchSize);

protected abstract void CreateDataLoader(int batchSize);
}
}
118 changes: 118 additions & 0 deletions mzLib/MachineLearning/DictionaryBuilder.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
using System;
using System.Collections.Generic;
using System.Linq;

namespace UsefulProteomicsDatabases
{
/// <summary>
/// This class provides static methods to build dictionaries for different models.
/// </summary>
public static class DictionaryBuilder
{
public static Dictionary<(char, string), int>
GetChronologerDictionary(TypeOfDictionary dict)
{
var dictionary = new Dictionary<(char, string), int>()
{
{ ('A', ""), 1 }, //'Alanine
{ ('C', ""), 2 }, //'Cysteine
{ ('D', ""), 3 }, //'Aspartate
{ ('E', ""), 4 }, //'Glutamate
{ ('F', ""), 5 }, //'Phenylalaline
{ ('G', ""), 6 }, //'Glycine
{ ('H', ""), 7 }, //'Histidine
{ ('I', ""), 8 }, //'Isoleucine
{ ('K', ""), 9 }, //'Lysine
{ ('L', ""), 10 }, //'Leucine
{ ('M', ""), 11 }, //'Methionine
{ ('N', ""), 12 }, //'Asparagine
{ ('P', ""), 13 }, //'Proline
{ ('Q', ""), 14 }, //'Glutamine
{ ('R', ""), 15 }, //'Argenine
{ ('S', ""), 16 }, //'Serine
{ ('T', ""), 17 }, //'Threonine
{ ('V', ""), 18 }, //'Valine
{ ('W', ""), 19 }, //'Tryptophane
{ ('Y', ""), 20 }, //'Tyrosine
{('C', "C-Terminus"), 38},
{('N', "N-Terminus"), 44}
};

//todo: change numbers to accession numbers from DB
if (dict == TypeOfDictionary.Unimod)
{
int aaCount = 20;

Check warning on line 44 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L43-L44

Added lines #L43 - L44 were not covered by tests

var mods =
Loaders.LoadUnimod(@"F:\Research\Data\unimod.xml").ToList();

Check warning on line 47 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L46-L47

Added lines #L46 - L47 were not covered by tests

var groupedModsByOriginalID =
mods.GroupBy(x => x.Target.ToString()).ToList();

Check warning on line 50 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L49-L50

Added lines #L49 - L50 were not covered by tests

foreach (var target in groupedModsByOriginalID)
{

Check warning on line 53 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L53

Added line #L53 was not covered by tests
foreach (var mod in target)
{
aaCount = aaCount + 1;

Check warning on line 56 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L55-L56

Added lines #L55 - L56 were not covered by tests

//C-Terminus and N-Terminus
if(aaCount == 38 || aaCount == 44)
{
aaCount = aaCount + 1;
}

Check warning on line 62 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L60-L62

Added lines #L60 - L62 were not covered by tests

if (!dictionary.ContainsKey((Char.Parse(target.Key), mod.IdWithMotif)))
dictionary.Add((Char.Parse(target.Key), mod.IdWithMotif), aaCount);
}
}

Check warning on line 67 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L65-L67

Added lines #L65 - L67 were not covered by tests

return dictionary;

Check warning on line 69 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L69

Added line #L69 was not covered by tests
}

if (dict == TypeOfDictionary.Chronologer)
{
var chronologerModsDict = new Dictionary<(char, string), int>()
{
{ ('C', "Carbamidomethyl on C"), 21 }, //'Carbamidomethyl
{ ('M', "Oxidation on M"), 22 }, //'Oxidized
{ ('C', "Pyro-carbamidomethyl on C"), 23 },//'S - carbamidomethylcysteine
{ ('E', "Glu to PyroGlu"), 24 }, //'Pyroglutamate
{ ('S', "Phosphorylation on S"), 25 }, //'Phosphoserine
{ ('T', "Phosphorylation on T"), 26 }, //'Phosphothreonine
{ ('Y', "Phosphorylation on Y"), 27 }, //'Phosphotyrosine
{ ('K', "Accetylation on K"), 28 }, //'Acetylated
{ ('K', "Succinylation on K"), 29 }, //'Succinylated
{ ('K', "Ubiquitination on K"), 30 }, //'Ubiquitinated
{ ('K', "Methylation on K"), 31 }, //'Monomethyl
{ ('K', "Dimethylation on K"), 32 }, //'Dimethyl
{ ('K', "Trimethylation on K"), 33 }, //'Trimethyl
{ ('R', "Methylation on R"), 34 }, //'Monomethyl
{ ('R', "Dimethylation on R"), 35 }, //'Dimethyl
{ ('K', "TMT on K"), 36 }, //TMT0-modified lysisne
{ ('K', "TMT6plex on K"), 37 }, //tmt10-modified lysine

};

foreach (var item in chronologerModsDict)
{
if (!dictionary.ContainsKey(item.Key))
dictionary.Add(item.Key, item.Value);
}
}

if (dict == TypeOfDictionary.CanonicalAA)
{
return dictionary;

Check warning on line 105 in mzLib/MachineLearning/DictionaryBuilder.cs

View check run for this annotation

Codecov / codecov/patch

mzLib/MachineLearning/DictionaryBuilder.cs#L104-L105

Added lines #L104 - L105 were not covered by tests
}

return dictionary;
}

public enum TypeOfDictionary
{
CanonicalAA,
Chronologer,
Unimod
}
}
}
24 changes: 24 additions & 0 deletions mzLib/MachineLearning/MachineLearning.csproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFramework>net6.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
<Platforms>x64</Platforms>
</PropertyGroup>

<ItemGroup>
<ProjectReference Include="..\Proteomics\Proteomics.csproj" />
<ProjectReference Include="..\UsefulProteomicsDatabases\UsefulProteomicsDatabases.csproj" />
</ItemGroup>

<ItemGroup>
<None Update="RetentionTimePredictionModels\Chronologer_20220601193755_TorchSharp.dat">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
<None Update="unimod.xml">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>

</Project>
42 changes: 42 additions & 0 deletions mzLib/MachineLearning/MachineLearningDiagram.cd
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
<?xml version="1.0" encoding="utf-8"?>
<ClassDiagram MajorVersion="1" MinorVersion="1">
<Class Name="MachineLearning.DeepTorch">
<Position X="8" Y="2" Width="4.5" />
<TypeIdentifier>
<HashCode>AAAgAgAAAACAAgAAAAAxAAAIICgAIAQAAIAACABCAAg=</HashCode>
<FileName>DeepTorch.cs</FileName>
</TypeIdentifier>
</Class>
<Class Name="MachineLearning.TorchDataset">
<Position X="14.25" Y="2.5" Width="3" />
<TypeIdentifier>
<HashCode>AAAAAAACAAAAAAAAAAAAAAQAAAAAABAAAAAAQAAAAAA=</HashCode>
<FileName>TorchDataset.cs</FileName>
</TypeIdentifier>
</Class>
<Class Name="MachineLearning.RetentionTimePredictionModels.Chronologer">
<Position X="8.75" Y="8" Width="3.25" />
<TypeIdentifier>
<HashCode>ABIgAgAIAACAAAAAAAAwQAAIOCAAAEAAAIAACAACACg=</HashCode>
<FileName>RetentionTimePredictionModels\Chronologer.cs</FileName>
</TypeIdentifier>
</Class>
<Class Name="UsefulProteomicsDatabases.DictionaryBuilder">
<Position X="13.75" Y="8" Width="4.25" />
<Compartments>
<Compartment Name="Nested Types" Collapsed="false" />
</Compartments>
<NestedTypes>
<Enum Name="UsefulProteomicsDatabases.DictionaryBuilder.TypeOfDictionary" Collapsed="true">
<TypeIdentifier>
<NewMemberFileName>DictionaryBuilder.cs</NewMemberFileName>
</TypeIdentifier>
</Enum>
</NestedTypes>
<TypeIdentifier>
<HashCode>AAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=</HashCode>
<FileName>DictionaryBuilder.cs</FileName>
</TypeIdentifier>
</Class>
<Font Name="Segoe UI" Size="9" />
</ClassDiagram>
Loading
Loading