Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of subsetByColData method #334

Conversation

leopoldguyot
Copy link
Contributor

Hello, I may have found a possible optimization of the subsetByColData method:

Description

In a current project, I am benchmarking the performance of the QFeatures class, which inherits from the MultiAssayExperiment class. During my benchmarks, I discovered that the runtime to subset a QFeatures object by columns (colData) increases exponentially with the number of total columns.

The subsetting of QFeatures directly calls the subsetByColData method from MultiAssayExperiment. Upon analyzing the code of this method, I found that it performs unnecessary iterations.

Problem

For each experiment in the experimentList, the current implementation iterates over all the column names that are kept from all the experiments. However, this is unnecessary, it is sufficient to check only for the column names that are present within the current experiment.

Solution

A simple fix is to iterate over the intersection between the kept columns and the columns of the current experiment. This change reduces the complexity of the operation from O(k²) to O(k) where k is the number of experiment:

 listMap <- lapply(listMap, function(elementMap, keepers) { # k iterations - Total = 0(k²*n)
            .matchReorderSub(elementMap, keepers) # O(k*n)
        },
        keepers = rownames(newcoldata)
    )
 listMap <- lapply(listMap, function(elementMap) { # k iterations - Total = O(k*n)
        .matchReorderSub(elementMap, # O(n)
                             intersect(rownames(newcoldata),
                                       elementMap$primary))
    })

All unit tests passed during the local check.

Runtime analysis

Below are runtime plots that illustrate the improvement with the proposed fix. The updated implementation significantly reduces runtime, particularly for larger datasets. I also compared the bracket operator (bracket) with the subsetByColData method (subset).

I used a QFeatures object has it was easier for me, it is a MultiAssayExperiment object with x assays each containing 18 columns.

subsetTime

I also compared the memory allocation between the two implementations.
subsetMem

Here is the code used to make the runtime analysis:

library(callr)
library(tidyr)
library(ggplot2)
library(plotly)

runPress <- function(git, nAssays, rep = 3L) {
    remotes::install_github(git, force = TRUE)
    library(bench)
    library(MultiAssayExperiment)
    library(QFeatures)
    library(scpdata)
    qfeatures <- leduc2022_pSCoPE() # qfeatures object retrieved with scpdata
    colData(qfeatures)$filterBench <- rnorm(nrow(colData(qfeatures)), mean = 1, sd = 1)
    bench::press(
        nAssays = nAssays,
        rep = 1:3,
        {
            subqfeatures <- qfeatures[,, 1:nAssays] # subsetting to get different input size
            results <- bench::mark(
                bracket = subqfeatures[, subqfeatures$filterBench > 1,],
                subset = subsetByColData(subqfeatures, subqfeatures$filterBench > 1)
            )

            tibble::tibble(
                expression = as.character(results$expression),
                median_time = results$median,
                mem_alloc = results$mem_alloc
            )
        }
    )
}

new <- callr::r(
    func = runPress,
    args = list(git = "leopoldguyot/MultiAssayExperiment@subsetByColData_optimization",
                nAssays = c(5, 20, 50, 80, 120),
                rep = 3L)
)

old <- callr::r(
    func = runPress,
    args = list(git = "waldronlab/MultiAssayExperiment@325c85ca26582027bcdd643df673a06ae649a36b",
                nAssays = c(5, 20, 50, 80, 120),
                rep = 3L)
)

old <- old %>%
    mutate(version = "Old")
new <- new %>%
    mutate(version = "New")
combined <- bind_rows(old, new) %>%
    mutate(median_time = as.numeric(median_time), # in seconds
           mem_alloc = as.numeric(sub("MB", "",
                                      as.character(mem_alloc))))

combined %>%
    mutate(
        nCol =nAssays*18,
        nAssays = as.factor(nAssays)) %>%
    ggplot(aes(x = nCol, y = median_time, color = version))+
        xlab("Total number of columns")+
        ylab("RunTime (s)")+
        geom_point()+
        geom_smooth()+
        facet_wrap(~expression)

combined %>%
    mutate(
        nCol = nAssays*18,
        nAssays = as.factor(nAssays),
        mem_alloc_MB = mem_alloc/(1024**2)) %>%
    ggplot(aes(x = nCol, y = mem_alloc_MB, color = version))+
    xlab("Total number of columns")+
    ylab("RAM used (MB)")+
    geom_point()+
    geom_smooth()+
    facet_wrap(~expression)

@lgatto

Best,
Léopold

@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Dec 19, 2024

Thank you! It was added at a293ab1

@LiNk-NY LiNk-NY closed this Dec 19, 2024
@leopoldguyot
Copy link
Contributor Author

Perfect, thank you!

@leopoldguyot leopoldguyot deleted the subsetByColData_optimization branch December 20, 2024 09:25
@lwaldron
Copy link
Member

What a pleasure to see a problem identified, benchmarked, patched, PRed, and merged! Thank you @leopoldguyot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants