Skip to content

Commit

Permalink
Feature/per search extension (#6)
Browse files Browse the repository at this point in the history
* initial branch commit

* addition of per dimension expansions and tuning

* updated default params and tuning, also propagated new params

* refactored config and clippy suggestions

* propagated params change to tracing

* removed chimera search and final pre-asms readme chages
  • Loading branch information
jspaezp authored Jun 3, 2024
1 parent fb269e9 commit 7d94a23
Show file tree
Hide file tree
Showing 15 changed files with 641 additions and 207 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
/venv
/tmp
/run_1k
/benchmark
traces_debug.csv
pseudoscans_debug.json

32 changes: 25 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# Ionmesh
# IonMesh

## General functioning

Expand Down Expand Up @@ -49,6 +49,17 @@ ims_scaling = 0.02
min_n = 4
min_neighbor_intensity = 500
[sage_search_config]
static_mods = [[
"C",
57.0215,
]]
variable_mods = [[
"M",
[15.9949],
]]
fasta_path = "./tmp/UP000005640_9606.fasta"
[output_config] # These options can be missing, if missing will not output the files.
out_features_csv = "features.csv"
debug_traces_csv = "debug_traces.csv"
Expand All @@ -70,12 +81,23 @@ DEBUG_TRACES_FROM_CACHE=1 # If set and non empty will load the traces from the c
## Roadmap

1. Use aggregation metrics to re-score sage search.
2. Do a two pass speudospec generation, where the first pass finds the centroids and the second pass aggregates around a radius. (this will prevent the issue where common ions, like b2's are assigned only to the most intense spectrum in a window....)
2. [In progress] Do a two pass speudospec generation, where the first pass finds the centroids and the second pass aggregates around a radius. (this will prevent the issue where common ions, like b2's are assigned only to the most intense spectrum in a window....)
- RN I believe it is over-aggregating peaks and leading to a lot of straggler peaks.
3. Re-define rt parmeters in the config as a function of the cycle time and not raw seconds.
- This can help with the observed fact that params perform very differently on instrument/method variants. (Some reparametrization could be % of frame intensity vs absolute intensity... Number of cycles instead of retention time...
Use more actively the intensity of the cluster rather than the number of neighbors...)
4. Add targeted extraction.
5. Add detection of MS1 features + notched search instead of wide window search.
6. Clean up some of the features and decide what aggregation steps use interal paralellism. (in some steps making multiple aggregations in paralle is better than doing parallel operations within the aggregation).
- Fix nomenclature ... I dont like how it is not consistent (indexed, indexer, index are using interchangeably ...).
7. Compilation warning cleanup.
8. Clean up dead/commented out code.
9. Refactor `max_extension_distances` argument in the generic dbscan implementation to prevent the errors that might arise from mixing up the dimensions.
- Should that be a propoerty of the converter?
10. Commit to f32/f64 in specific places ... instead of the harder to maintain generic types.
11. Add CICD to distribute the pre-compiled binaries.
12. Add semver checks to the CICD pipeline.
13. Add IMS output to the sage report.

## Maybe in the roadmap

Expand All @@ -85,8 +107,4 @@ DEBUG_TRACES_FROM_CACHE=1 # If set and non empty will load the traces from the c

## Where are we at?

- Ids are not great ... They do seem good via manual inspection but the number of ids is low.




- Ids are pretty close to the equivalent DDA runs with the correct parameters ... They do seem good via manual inspection but the number of ids is low compared to peptide-centric searches.
16 changes: 8 additions & 8 deletions src/aggregation/chromatograms.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

use log::warn;
use num_traits::AsPrimitive;
use rerun::external::arrow2::io::print;

use std::collections::BTreeMap;
use std::ops::{AddAssign, Mul};

Expand All @@ -24,9 +24,9 @@ pub struct ChromatogramArray <T: Mul<Output = T> + AddAssign + Default + AsPrimi

impl BTreeChromatogram {
/// Create a new BTreeChromatogram with the given RT binsize and bin offset.
///
///
/// The values in bin = 0 will be in the range [bin_offset, bin_offset + binsize)
///
///
pub fn new(rt_binsize: f32, rt_bin_offset: f32) -> Self {
BTreeChromatogram {
btree: BTreeMap::new(),
Expand Down Expand Up @@ -81,7 +81,7 @@ impl BTreeChromatogram {
match min {
Some(min) => {
let max = *self.btree.keys().last().unwrap();
Some((*min as i32, max as i32))
Some((*min, max))
}
None => None,
}
Expand Down Expand Up @@ -150,10 +150,10 @@ impl BTreeChromatogram {

// The chromatogram uses the bin size of the chromatogram btree
// but re-centers it to the mean RT of the trace
if self.btree.len() > 0 {
if !self.btree.is_empty() {
let int_center = ((center_rt.unwrap_or(0.) - self.rt_bin_offset.unwrap()) / self.rt_binsize) as i32;
let left_start = int_center - (NUM_LOCAL_CHROMATOGRAM_BINS / 2) as i32;

for i in 0..NUM_LOCAL_CHROMATOGRAM_BINS {
let bin = left_start + i as i32;
chromatogram_arr[i] = *self.btree.get(&bin).unwrap_or(&0) as f32;
Expand Down Expand Up @@ -223,7 +223,7 @@ mod chromatogram_tests {
assert_eq!(c.get_bin(&0), Some(&1));
assert!(c.get_bin(&-1).is_none());
let neg_one_rt = c.bin_to_rt(-1);
assert!(neg_one_rt >= -1.01 && neg_one_rt <= -0.99);
assert!((-1.01..=-0.99).contains(&neg_one_rt));

let mut c2 = BTreeChromatogram::new(1., 0.);
c2.add(0., 1);
Expand Down Expand Up @@ -277,7 +277,7 @@ mod chromatogram_tests {
c.chromatogram[4] = 20;
let cosine = c.cosine_similarity(&c2).unwrap();
assert!(cosine <= 0.9, "Cosine: {}", cosine);

}

#[test]
Expand Down
130 changes: 114 additions & 16 deletions src/aggregation/dbscan.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ use crate::mod_types::Float;
use crate::ms::frames;
use crate::space::space_generics::{HasIntensity, IndexedPoints, NDPoint};
use indicatif::ProgressIterator;
use log::{debug, info};
use log::{debug, info, trace};

use rayon::prelude::*;

Expand Down Expand Up @@ -168,6 +168,7 @@ fn _dbscan<
filter_fun: Option<FF>,
converter: C,
progress: bool,
max_extension_distances: &[Float;N],
) -> (u64, Vec<ClusterLabel<u64>>) {
let mut initial_candidates_counts = utils::RollingSDCalculator::default();
let mut final_candidates_counts = utils::RollingSDCalculator::default();
Expand Down Expand Up @@ -254,7 +255,6 @@ fn _dbscan<
let mut seed_set: Vec<&usize> = Vec::new();
seed_set.extend(neighbors);

const MAX_EXTENSION_DISTANCE: Float = 5.;
let mut internal_neighbor_additions = 0;

while let Some(neighbor) = seed_set.pop() {
Expand Down Expand Up @@ -304,9 +304,15 @@ fn _dbscan<
let p = &quad_points[**i];
let query_point = query_elems.1.unwrap();
// Using minkowski distance with p = 1, manhattan distance.
let dist = (p.values[0] - query_point.values[0]).abs()
+ (p.values[1] - query_point.values[1]).abs();
let within_distance = dist <= MAX_EXTENSION_DISTANCE;
let mut within_distance = true;
for ((p, q), max_dist) in p.values.iter().zip(query_point.values).zip(max_extension_distances.iter()) {
let dist = (p - q).abs();
within_distance = within_distance && dist <= *max_dist;
if !within_distance {
break;
}
}

going_downhill && within_distance
});
local_neighbor_filter_timer.stop(false);
Expand Down Expand Up @@ -435,8 +441,8 @@ pub fn aggregate_clusters<
>(
tot_clusters: u64,
cluster_labels: Vec<ClusterLabel<u64>>,
elements: Vec<T>,
def_aggregator: F,
elements: &[T],
def_aggregator: &F,
log_level: utils::LogLevel,
keep_unclustered: bool,
) -> Vec<R> {
Expand Down Expand Up @@ -552,11 +558,68 @@ pub fn aggregate_clusters<
out
}

// Pretty simple function ... it uses every passed centroid, converts it to a point
// and generates a new centroid that aggregates all the points in its range.
// In contrast with the dbscan method, the elements in each cluster are not necessarily
// mutually exclusive.
fn reassign_centroid<
'a,
const N: usize,
T: HasIntensity<Z> + Send + Clone + Copy,
C: NDPointConverter<R, N>,
I: IndexedPoints<'a, N, usize> + std::marker::Sync,
G: Sync + Send + ClusterAggregator<T, R>,
R: Send,
F: Fn() -> G + Send + Sync,
Z: AsPrimitive<u64>
+ Send
+ Sync
+ Add<Output = Z>
+ PartialOrd
+ Div<Output = Z>
+ Mul<Output = Z>
+ Default
+ Sub<Output = Z>,
>(
centroids: Vec<R>,
indexed_points: &'a I,
centroid_converter: C,
elements: &Vec<T>,
def_aggregator: F,
log_level: utils::LogLevel,
expansion_factors: &[Float;N],
) -> Vec<R> {
let mut timer = utils::ContextTimer::new("reassign_centroid", true, log_level);
let mut out = Vec::with_capacity(centroids.len());

for centroid in centroids {
let query_point = centroid_converter.convert(&centroid);
let mut query_elems = centroid_converter.convert_to_bounds_query(&query_point);
query_elems.0.expand(expansion_factors);

// trace!("Querying for Centroid: {:?}", query_elems.1);
// trace!("Querying for Boundary: {:?}", query_elems.0);
let neighbors = indexed_points.query_ndrange(&query_elems.0, query_elems.1);
// trace!("Found {} neighbors", neighbors.len());
let mut aggregator = def_aggregator();
let mut num_agg = 0;
for neighbor in neighbors {
aggregator.add(&elements[*neighbor]);
num_agg += 1;
}
trace!("Aggregated {} elements", num_agg);
out.push(aggregator.aggregate());
}

timer.stop(true);
out
}
// TODO: rename prefiltered peaks argument!
// TODO implement a version that takes a sparse distance matrix.

pub fn dbscan_generic<
C: NDPointConverter<T, N>,
C2: NDPointConverter<R, N>,
R: Send,
G: Sync + Send + ClusterAggregator<T, R>,
T: HasIntensity<Z> + Send + Clone + Copy + Sync,
Expand All @@ -582,6 +645,8 @@ pub fn dbscan_generic<
extra_filter_fun: Option<&FF>,
log_level: Option<utils::LogLevel>,
keep_unclustered: bool,
max_extension_distances: &[Float;N],
back_converter: Option<C2>,
) -> Vec<R> {
let show_progress = log_level.is_some();
let log_level = match log_level {
Expand Down Expand Up @@ -625,17 +690,46 @@ pub fn dbscan_generic<
extra_filter_fun,
converter,
show_progress,
max_extension_distances,
);
i_timer.stop(true);

aggregate_clusters(
let centroids = aggregate_clusters(
tot_clusters,
cluster_labels,
prefiltered_peaks,
def_aggregator,
&prefiltered_peaks,
&def_aggregator,
log_level,
keep_unclustered,
)
);

match back_converter {
Some(bc) => {

reassign_centroid(
centroids,
&tree,
bc,
&prefiltered_peaks,
&def_aggregator,
log_level,
max_extension_distances,
)
}
None => {
centroids
}
}
}

// https://github.com/rust-lang/rust/issues/35121
// The never type is not stable yet....
struct BypassDenseFrameBackConverter {}

impl NDPointConverter<frames::TimsPeak, 2> for BypassDenseFrameBackConverter {
fn convert(&self, _elem: &frames::TimsPeak) -> NDPoint<2> {
panic!("This should never be called")
}
}

struct DenseFrameConverter {
Expand All @@ -658,8 +752,10 @@ type FFTimsPeak = fn(&TimsPeak, &TimsPeak) -> bool;
// <FF: Send + Sync + Fn(&TimsPeak, &TimsPeak) -> bool>
pub fn dbscan_denseframes(
mut denseframe: frames::DenseFrame,
mz_scaling: &f64,
ims_scaling: &f32,
mz_scaling: f64,
max_mz_extension: f64,
ims_scaling: f32,
max_ims_extension: f32,
min_n: usize,
min_intensity: u64,
) -> frames::DenseFrame {
Expand All @@ -673,7 +769,7 @@ pub fn dbscan_denseframes(
let keep_vector = within_distance_apply(
&denseframe.raw_peaks,
&|peak| peak.mz,
mz_scaling,
&mz_scaling,
&|i_right, i_left| (i_right - i_left) >= min_n,
);

Expand All @@ -690,8 +786,8 @@ pub fn dbscan_denseframes(
};

let converter = DenseFrameConverter {
mz_scaling: *mz_scaling,
ims_scaling: *ims_scaling,
mz_scaling,
ims_scaling,
};
let peak_vec: Vec<TimsPeak> = dbscan_generic(
converter,
Expand All @@ -702,6 +798,8 @@ pub fn dbscan_denseframes(
None::<&FFTimsPeak>,
None,
true,
&[max_mz_extension as Float, max_ims_extension as Float],
None::<BypassDenseFrameBackConverter>,
);

frames::DenseFrame {
Expand Down
Loading

0 comments on commit 7d94a23

Please sign in to comment.