Skip to content

Commit

Permalink
Changes to MBR within FlashLFQ (#802)
Browse files Browse the repository at this point in the history
* Predict Retention Time function added

* breaking up mbr function, cont'd

* Decoy search is working

* Removed decoy toggle from GetIsotopicEnvelopes

* About to start MbrScorer refactor

* Refactor of MBR is working succesfully

* Deleted unused code

* got those decoys decoying naw mean

* refactore MBR to use concurrent dictionaries

* Test method with output

* Deleted unused fields and references to PearsonCorrelation in isotopicEnvelope class and CheckEnvelope method

* Deleted unused using statements in FlashLFQ engine

* Commented + removed unused variables for PredicteRetentionTime method

* Added comments, removed unused using statements

* Deleted unused variables

* Tests are still passing, but I'm about to break things

* Edited scoring method

* Umm, edited scoring, increased ppm tolerance, changed minimum rt width for variance and random rt transfers

* Did some weird filtering of donor files that should probably be reverted

* Peak Organism fix

* Basically works, QValue still underestimates

* Minor changes to peakfinding

* MSMS double checking

* Removed msms double checking

* Add check for MBR/MSMS peak collision

* Changed peak decoy selection method

* Optional PeptidesForMbr argument added, not covered in previous commit messages. Not actually this commit

* 5.3.2.4 - Changed randomRT selection method

* 5.3.2.5 - Minimum rtWindowWidth = 30 seconds

* Actually updated nuget

* 5.3.2.6 - Fixed decoy search. 2.4 and 2.5 are junk

* Changed MBR RT prediction method

* Fixed bug in rtPrediction

* Changed decoy search method and fixed issues with decoy rt scoring

* small changes to double check procedure

* Finished Merging in MbrFdr - mzLib 5341

* minor

* Actually changed nuspec

* Increased Rt Range

* Increased doublecheck count to 2500. mzLib 5343

* reduced window slightly. mzLib 5344

* amended rt scoring distribution in mbr scorer

* Mostly deleted stuff that will be introduced in a different commit

* Fixed all tests but one

* Fixed final test

* minor

* Merged in important parts from MbrDoubleCheck

* Changed decoy peak pairing

* finished update

* Updated anchors

* MbrThresholdAsParam

* Prevented decoys from being used for protein quant

* minor

* Deleted MSMS double check, changed scoring algo (removed intensity)

* reverted scoring function

* Fixed issue where decoy peaks were being used to calculate peptide intensities

* Fixed issue where low scoring peak traces could overwrite intensity values in MBR

* Removed minimum for RT window

* minor

* Fixed issue where MbrScorer would crash if too few anchor peptides pairs were passed in

* Finished test structure to use for creating censored mzMls

* Started and failed fragger stuff

* minor

* Added fragger experiment class

* Minor

* MaxQuant Reader started

* Fixed omics issue. Finished MaxQuantEvidence

* Added maxQuantEvidence, MqResultsFile, got it working

* minor

* minor

* changed nuspec

* minor

* Added renomalization function to FlashLFQ results, incremented nuspec

* minor

* idk

* minor

* Deleted unrelated changes

* Added FraggerMasterFile

* fixed errors introduced by merge

* idk

* Changing parameters

* about to break things

* Implemented better PIP donor selection

* Implemented PEP, and it's working

* Included ML.NET in nuspec

* idk

* Semi-supervised PEP

* Retrain within splits

* Working

* Added donor groups, pep is working

* Multiple peaks per donor

* Switched to 514, single donor training

* mzLib 315

* mzLib 315 for real

* Commit before decoy pairing

* Creating groups by decoy count only

* idk

* Implemented cursed algorithm for donorGroup partitioning

* mzLib 324

* mzLib 5.326

* 327

* mzLib 328

* mzLib 329

* nuspec

* Updates to window width calculations

* More changes to window width calculations

* nuspec update

* reverted donor exclusion window stuff

* Fixing merge conflicts one at a time

* minor

* Fixed errors, deleted tests with local file refs

* Merged master into Pip-ECHO FlashLFQ, fixed all the issues that arose.

* Reverted unneccessary changes

* minor

* minor

* minor

* Pep engine no longer static

* MlContext Seed

* xyz

* messed with the .yml

* More logs

* Even more logs

* dsf

* One million logs

* thread safe rng

* Fixed random issue, tests are passing

* More lgos

* Stabilized results?

* Fixed issues

* Deleted unused properties, changed ChromatographicPeak tostring

* Removed unintended changes

* started refactor

* Commit before breaking things

* Refactored donorGroup equalization, added test

* Finished tests for Pip-ECHO

* Added more unit tests, addressed Shortreed's comments

* Reverted changes to github actions .yml, updated MicrosoftML

* Update dotnet.yml

* Updated comments

* Revert nuspec

* Edited realDataMbr test to account for change to MbrScorer

* More changes to tests

* Minor

* Reverted unnecessary changes

* added additional comments

* More comments

* Fixed nuspec

---------

Co-authored-by: trishorts <mshort@chem.wisc.edu>
  • Loading branch information
Alexander-Sol and trishorts authored Nov 5, 2024
1 parent b055693 commit e5cf73e
Show file tree
Hide file tree
Showing 19 changed files with 2,036 additions and 353 deletions.
88 changes: 38 additions & 50 deletions mzLib/FlashLFQ/ChromatographicPeak.cs
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
using MzLibUtil;
using System;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using ClassExtensions = Chemistry.ClassExtensions;
using FlashLFQ.PEP;

namespace FlashLFQ
{
Expand All @@ -16,20 +16,23 @@ public class ChromatographicPeak
public int ScanCount => IsotopicEnvelopes.Count;
public double SplitRT;
public readonly bool IsMbrPeak;
public double PredictedRetentionTime { get; init; }

public double MbrScore;
public double PpmScore { get; set; }
public double IntensityScore { get; set; }
public double RtScore { get; set; }
public double ScanCountScore { get; set; }
public double IsotopicDistributionScore { get; set; }
/// <summary>
/// A score bounded by 100 and 0, with more confident MBR-detections receiving higher scores
/// Stores the pearson correlation between the apex isotopic envelope and the theoretical isotopic distribution
/// </summary>
public double MbrScore { get; private set; }

/// The four scores below are bounded by 0 and 1, with higher scores being better
public double PpmScore { get; private set; }
public double IntensityScore { get; private set; }
public double RtScore { get; private set; }
public double ScanCountScore { get; private set; }

public ChromatographicPeak(Identification id, bool isMbrPeak, SpectraFileInfo fileInfo)
public double IsotopicPearsonCorrelation => Apex?.PearsonCorrelation ?? -1;
public double RtPredictionError { get; set; }
public List<int> ChargeList { get; set; }
internal double MbrQValue { get; set; }
public ChromatographicPeakData PepPeakData { get; set; }
public double? MbrPep { get; set; }

public ChromatographicPeak(Identification id, bool isMbrPeak, SpectraFileInfo fileInfo, bool randomRt = false)
{
SplitRT = 0;
NumChargeStatesObserved = 0;
Expand All @@ -40,12 +43,14 @@ public ChromatographicPeak(Identification id, bool isMbrPeak, SpectraFileInfo fi
IsotopicEnvelopes = new List<IsotopicEnvelope>();
IsMbrPeak = isMbrPeak;
SpectraFileInfo = fileInfo;
RandomRt = randomRt;
}

public ChromatographicPeak(Identification id, bool isMbrPeak, SpectraFileInfo fileInfo, double predictedRetentionTime) :
this(id, isMbrPeak, fileInfo)
public bool Equals(ChromatographicPeak peak)
{
PredictedRetentionTime = predictedRetentionTime;
return SpectraFileInfo.Equals(peak.SpectraFileInfo)
&& Identifications.First().ModifiedSequence.Equals(peak.Identifications.First().ModifiedSequence)
&& ApexRetentionTime == peak.ApexRetentionTime;
}

public IsotopicEnvelope Apex { get; private set; }
Expand All @@ -54,13 +59,18 @@ public ChromatographicPeak(Identification id, bool isMbrPeak, SpectraFileInfo fi
public int NumIdentificationsByBaseSeq { get; private set; }
public int NumIdentificationsByFullSeq { get; private set; }
public double MassError { get; private set; }
/// <summary>
/// Bool that describes whether the retention time of this peak was randomized
/// If true, implies that this peak is a decoy peak identified by the MBR algorithm
/// </summary>
public bool RandomRt { get; }
public bool DecoyPeptide => Identifications.First().IsDecoy;

public void CalculateIntensityForThisFeature(bool integrate)
{
if (IsotopicEnvelopes.Any())
{
double maxIntensity = IsotopicEnvelopes.Max(p => p.Intensity);
Apex = IsotopicEnvelopes.First(p => p.Intensity == maxIntensity);
Apex = IsotopicEnvelopes.MaxBy(p => p.Intensity);

if (integrate)
{
Expand Down Expand Up @@ -123,25 +133,6 @@ public void ResolveIdentifications()
this.NumIdentificationsByBaseSeq = Identifications.Select(v => v.BaseSequence).Distinct().Count();
this.NumIdentificationsByFullSeq = Identifications.Select(v => v.ModifiedSequence).Distinct().Count();
}

/// <summary>
/// Calculates four component scores and one overarching Mbr score for an MBR peak.
/// MBR Score is equal to 100 * the geometric mean of the four component scores.
/// </summary>
/// <param name="scorer"> An MbrScorer specific to the file where this peak was found </param>
/// <param name="donorPeak"> The donor peak used as the basis for the MBR identification. </param>
internal void CalculateMbrScore(MbrScorer scorer, ChromatographicPeak donorPeak)
{
if (SpectraFileInfo != scorer.AcceptorFile) throw new MzLibException("Error when performing match-between-runs: Mismatch between scorer and peak.");

IntensityScore = scorer.CalculateIntensityScore(this, donorPeak);
RtScore = scorer.CalculateRetentionTimeScore(this, donorPeak);
PpmScore = scorer.CalculatePpmErrorScore(this);
ScanCountScore = scorer.CalculateScanCountScore(this);

MbrScore = 100 * Math.Pow(IntensityScore * RtScore * PpmScore * ScanCountScore, 0.25);
}

public static string TabSeparatedHeader
{
get
Expand All @@ -164,20 +155,18 @@ public static string TabSeparatedHeader
sb.Append("Peak Charge" + "\t");
sb.Append("Num Charge States Observed" + "\t");
sb.Append("Peak Detection Type" + "\t");
sb.Append("MBR Score" + "\t");
sb.Append("Ppm Score" + "\t");
sb.Append("Intensity Score" + "\t");
sb.Append("Rt Score" + "\t");
sb.Append("Scan Count Score" + "\t");
sb.Append("PIP Q-Value" + "\t");
sb.Append("PIP PEP" + "\t");
sb.Append("PSMs Mapped" + "\t");
sb.Append("Base Sequences Mapped" + "\t");
sb.Append("Full Sequences Mapped" + "\t");
sb.Append("Peak Split Valley RT" + "\t");
sb.Append("Peak Apex Mass Error (ppm)");
sb.Append("Peak Apex Mass Error (ppm)" + "\t");
sb.Append("Decoy Peptide" + "\t");
sb.Append("Random RT");
return sb.ToString();
}
}

public override string ToString()
{
StringBuilder sb = new StringBuilder();
Expand Down Expand Up @@ -260,17 +249,16 @@ public override string ToString()
sb.Append("" + "MSMS" + "\t");
}

sb.Append("" + (IsMbrPeak ? MbrScore.ToString() : "") + "\t");
sb.Append("" + (IsMbrPeak ? PpmScore.ToString() : "") + "\t");
sb.Append("" + (IsMbrPeak ? IntensityScore.ToString() : "") + "\t");
sb.Append("" + (IsMbrPeak ? RtScore.ToString() : "") + "\t");
sb.Append("" + (IsMbrPeak ? ScanCountScore.ToString() : "") + "\t");
sb.Append("" + (IsMbrPeak ? MbrQValue.ToString() : "") + "\t");
sb.Append("" + (IsMbrPeak ? MbrPep.ToString() : "") + "\t");

sb.Append("" + Identifications.Count + "\t");
sb.Append("" + NumIdentificationsByBaseSeq + "\t");
sb.Append("" + NumIdentificationsByFullSeq + "\t");
sb.Append("" + SplitRT + "\t");
sb.Append("" + MassError);
sb.Append("\t" + DecoyPeptide);
sb.Append("\t" + RandomRt);

return sb.ToString();
}
Expand Down
2 changes: 2 additions & 0 deletions mzLib/FlashLFQ/FlashLFQ.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
<ItemGroup>
<PackageReference Include="CsvHelper" Version="32.0.3" />
<PackageReference Include="MathNet.Numerics" Version="5.0.0" />
<PackageReference Include="Microsoft.ML" Version="3.0.1" />
<PackageReference Include="Microsoft.ML.FastTree" Version="3.0.1" />
<PackageReference Include="NetSerializer" Version="4.1.2" />
<PackageReference Include="SharpLearning.Optimization" Version="[0.28.0]" />
</ItemGroup>
Expand Down
38 changes: 27 additions & 11 deletions mzLib/FlashLFQ/FlashLFQResults.cs
Original file line number Diff line number Diff line change
Expand Up @@ -15,27 +15,26 @@ public class FlashLfqResults
public readonly Dictionary<string, ProteinGroup> ProteinGroups;
public readonly Dictionary<SpectraFileInfo, List<ChromatographicPeak>> Peaks;
private readonly HashSet<string> _peptideModifiedSequencesToQuantify;
public string PepResultString { get; set; }
public double MbrQValueThreshold { get; set; }

public FlashLfqResults(List<SpectraFileInfo> spectraFiles, List<Identification> identifications, HashSet<string> peptides = null)
public FlashLfqResults(List<SpectraFileInfo> spectraFiles, List<Identification> identifications, double mbrQValueThreshold = 0.05,
HashSet<string> peptideModifiedSequencesToQuantify = null)
{
SpectraFiles = spectraFiles;
PeptideModifiedSequences = new Dictionary<string, Peptide>();
ProteinGroups = new Dictionary<string, ProteinGroup>();
Peaks = new Dictionary<SpectraFileInfo, List<ChromatographicPeak>>();
if(peptides == null || !peptides.Any())
{
peptides = identifications.Select(id => id.ModifiedSequence).ToHashSet();
}
_peptideModifiedSequencesToQuantify = peptides;
MbrQValueThreshold = mbrQValueThreshold;
_peptideModifiedSequencesToQuantify = peptideModifiedSequencesToQuantify ?? identifications.Where(id => !id.IsDecoy).Select(id => id.ModifiedSequence).ToHashSet();

foreach (SpectraFileInfo file in spectraFiles)
{
Peaks.Add(file, new List<ChromatographicPeak>());
}


// Only quantify peptides within the set of valid peptide modified (full) sequences. This is done to enable pepitde-level FDR control of reported results
foreach (Identification id in identifications.Where(id => peptides.Contains(id.ModifiedSequence)))
foreach (Identification id in identifications.Where(id => !id.IsDecoy & _peptideModifiedSequencesToQuantify.Contains(id.ModifiedSequence)))
{
if (!PeptideModifiedSequences.TryGetValue(id.ModifiedSequence, out Peptide peptide))
{
Expand All @@ -59,6 +58,17 @@ public FlashLfqResults(List<SpectraFileInfo> spectraFiles, List<Identification>
}
}

public void ReNormalizeResults(bool integrate = false, int maxThreads = 10, bool useSharedPeptides = false)
{
foreach(var peak in Peaks.SelectMany(p => p.Value))
{
peak.CalculateIntensityForThisFeature(integrate);
}
new IntensityNormalizationEngine(this, integrate, silent: true, maxThreads).NormalizeResults();
CalculatePeptideResults(quantifyAmbiguousPeptides: false);
CalculateProteinResultsMedianPolish(useSharedPeptides: useSharedPeptides);
}

public void MergeResultsWith(FlashLfqResults mergeFrom)
{
this.SpectraFiles.AddRange(mergeFrom.SpectraFiles);
Expand Down Expand Up @@ -128,6 +138,8 @@ public void CalculatePeptideResults(bool quantifyAmbiguousPeptides)
{
var groupedPeaks = filePeaks.Value
.Where(p => p.NumIdentificationsByFullSeq == 1)
.Where(p => !p.Identifications.First().IsDecoy)
.Where(p => !p.IsMbrPeak || (p.MbrQValue < MbrQValueThreshold && !p.RandomRt))
.GroupBy(p => p.Identifications.First().ModifiedSequence)
.Where(group => _peptideModifiedSequencesToQuantify.Contains(group.Key))
.ToList();
Expand Down Expand Up @@ -163,11 +175,15 @@ public void CalculatePeptideResults(bool quantifyAmbiguousPeptides)
// report ambiguous quantification
var ambiguousPeaks = filePeaks.Value
.Where(p => p.NumIdentificationsByFullSeq > 1)
.Where(p => !p.Identifications.First().IsDecoy)
.Where(p => !p.IsMbrPeak || (p.MbrQValue < MbrQValueThreshold && !p.RandomRt))
.ToList();
foreach (ChromatographicPeak ambiguousPeak in ambiguousPeaks)
{
foreach (Identification id in ambiguousPeak.Identifications)
foreach (Identification id in ambiguousPeak.Identifications.Where(id => !id.IsDecoy))
{
if (!_peptideModifiedSequencesToQuantify.Contains(id.ModifiedSequence)) continue; // Ignore the ids/sequences we don't want to quantify

string sequence = id.ModifiedSequence;

double alreadyRecordedIntensity = PeptideModifiedSequences[sequence].GetIntensity(filePeaks.Key);
Expand Down Expand Up @@ -224,7 +240,7 @@ private void HandleAmbiguityInFractions()

foreach (SpectraFileInfo file in sample)
{
foreach (ChromatographicPeak peak in Peaks[file])
foreach (ChromatographicPeak peak in Peaks[file].Where(p => !p.IsMbrPeak || p.MbrQValue < MbrQValueThreshold))
{
foreach (Identification id in peak.Identifications)
{
Expand Down Expand Up @@ -549,7 +565,7 @@ public void CalculateProteinResultsMedianPolish(bool useSharedPeptides)
}
}

public void WriteResults(string peaksOutputPath, string modPeptideOutputPath, string proteinOutputPath, string bayesianProteinQuantOutput, bool silent = true)
public void WriteResults(string peaksOutputPath, string modPeptideOutputPath, string proteinOutputPath, string bayesianProteinQuantOutput, bool silent)
{
if (!silent)
{
Expand Down
Loading

0 comments on commit e5cf73e

Please sign in to comment.