Optimization of subsetByColData method #334
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello, I may have found a possible optimization of the subsetByColData method:
Description
In a current project, I am benchmarking the performance of the QFeatures class, which inherits from the MultiAssayExperiment class. During my benchmarks, I discovered that the runtime to subset a QFeatures object by columns (colData) increases exponentially with the number of total columns.
The subsetting of QFeatures directly calls the subsetByColData method from MultiAssayExperiment. Upon analyzing the code of this method, I found that it performs unnecessary iterations.
Problem
For each experiment in the experimentList, the current implementation iterates over all the column names that are kept from all the experiments. However, this is unnecessary, it is sufficient to check only for the column names that are present within the current experiment.
Solution
A simple fix is to iterate over the intersection between the kept columns and the columns of the current experiment. This change reduces the complexity of the operation from O(k²) to O(k) where k is the number of experiment:
All unit tests passed during the local check.
Runtime analysis
Below are runtime plots that illustrate the improvement with the proposed fix. The updated implementation significantly reduces runtime, particularly for larger datasets. I also compared the bracket operator (bracket) with the subsetByColData method (subset).
I used a QFeatures object has it was easier for me, it is a MultiAssayExperiment object with x assays each containing 18 columns.
I also compared the memory allocation between the two implementations.
![subsetMem](https://private-user-images.githubusercontent.com/104364239/397335396-30a3ce7e-06d0-449e-8105-4c19b33a78b9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MTE3NDcsIm5iZiI6MTczOTYxMTQ0NywicGF0aCI6Ii8xMDQzNjQyMzkvMzk3MzM1Mzk2LTMwYTNjZTdlLTA2ZDAtNDQ5ZS04MTA1LTRjMTliMzNhNzhiOS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNVQwOTI0MDdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04OTdlNDIyYTFjMDg2Nzg2NTQ0MzRiZmE1ZTA5ZGE4OTg3NDI4MjUxMzRiM2JkZTI5MzJhOWQ0NjhjMzcxMzMzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.kDpJq4QIVOabhZvpiuSI5hSg0jLTuY3lTaX79NMUrY4)
Here is the code used to make the runtime analysis:
@lgatto
Best,
Léopold