Skip to content

Commit

Permalink
Renamed classes and functions for more consistency, less redundancy. (#…
Browse files Browse the repository at this point in the history
…21)

This removes the "hdf5" component in the class and function names. It should be
obvious that we're dealing with HDF5 given that we're in the tatami_hdf5
namespace, so I don't see the need to repeat it. Also clarified that the sparse
readers/writers are operating on compressed sparse matrices.

The various *Options and *Parameters classes now follow the format of
"<class/function name>Options". We use different Options classes for dense and
sparse matrix constructors to future-proof for class-specific parameters. 

The nested enums have been moved out of the WriteSparseMatrixParameters class,
mostly for easier writing by callers (the fully namespaced name is pretty long)
but also to potentially allow re-use in a future dense matrix writer.

Also removed the Stored class in favor of tatami::ElementType.
  • Loading branch information
LTLA authored May 15, 2024
1 parent 70ef9c5 commit 8ee376b
Show file tree
Hide file tree
Showing 17 changed files with 538 additions and 516 deletions.
3 changes: 2 additions & 1 deletion docs/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -873,7 +873,8 @@ RECURSIVE = YES
# Note that relative paths are relative to the directory from which doxygen is
# run.

EXCLUDE =
EXCLUDE = ../include/tatami_hdf5/sparse_primary.hpp \
../include/tatami_hdf5/sparse_secondary.hpp

# The EXCLUDE_SYMLINKS tag can be used to select whether or not files or
# directories that are symbolic links (a Unix file system feature) are excluded
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@

#include <string>
#include <vector>
#include <type_traits>
#include <algorithm>
#include <cmath>

#include "tatami/tatami.hpp"

Expand All @@ -17,21 +15,38 @@
#include "utils.hpp"

/**
* @file Hdf5CompressedSparseMatrix.hpp
* @file CompressedSparseMatrix.hpp
*
* @brief Defines a class for a HDF5-backed compressed sparse matrix.
*/

namespace tatami_hdf5 {

/**
* @brief Options for HDF5 extraction.
*/
struct CompressedSparseMatrixOptions {
/**
* Size of the in-memory cache in bytes.
*
* We cache all chunks required to read a row/column in `tatami::MyopicDenseExtractor::fetch()` and related methods.
* This allows us to re-use the cached chunks when adjacent rows/columns are requested, rather than re-reading them from disk.
*
* Larger caches improve access speed at the cost of memory usage.
* Small values may be ignored as `CompressedSparseMatrix` will always allocate enough to cache a single element of the target dimension.
*/
size_t maximum_cache_size = 100000000;
};

/**
* @brief Compressed sparse matrix in a HDF5 file.
*
* This class retrieves sparse data from the HDF5 file on demand rather than loading it all in at the start.
* This allows us to handle very large datasets in limited memory at the cost of speed.
*
* We manually handle the chunk caching to speed up access for consecutive rows or columns (for compressed sparse row and column matrices, respectively).
* The policy is to minimize the number of calls to the HDF5 library - and thus expensive file reads - by requesting large contiguous slices where possible, i.e., multiple columns or rows for CSC and CSR matrices, respectively.
* The policy is to minimize the number of calls to the HDF5 library - and thus expensive file reads - by requesting large contiguous slices where possible,
* i.e., multiple columns or rows for CSC and CSR matrices, respectively.
* These are held in memory in the `Extractor` while the relevant column/row is returned to the user by `row()` or `column()`.
* The size of the slice is determined by the `options` in the constructor.
*
Expand All @@ -53,7 +68,7 @@ namespace tatami_hdf5 {
* if a smaller type is known to be able to store all indices (based on their HDF5 type or other knowledge).
*/
template<typename Value_, typename Index_, typename CachedValue_ = Value_, typename CachedIndex_ = Index_>
class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
class CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
Index_ nrows, ncols;
std::string file_name;
std::string data_name, index_name;
Expand All @@ -76,10 +91,11 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
* If `row = true`, this should contain column indices sorted within each row, otherwise it should contain row indices sorted within each column.
* @param ptr Name of the 1D dataset inside `file` containing the index pointers for the start and end of each row (if `row = true`) or column (otherwise).
* This should have length equal to the number of rows (if `row = true`) or columns (otherwise).
* @param row Whether the matrix is stored in compressed sparse row format.
* @param row Whether the matrix is stored on disk in compressed sparse row format.
* If false, the matrix is assumed to be stored in compressed sparse column format.
* @param options Further options.
*/
Hdf5CompressedSparseMatrix(Index_ nr, Index_ nc, std::string file, std::string vals, std::string idx, std::string ptr, bool row, const Hdf5Options& options) :
CompressedSparseMatrix(Index_ nr, Index_ nc, std::string file, std::string vals, std::string idx, std::string ptr, bool row, const CompressedSparseMatrixOptions& options) :
nrows(nr),
ncols(nc),
file_name(std::move(file)),
Expand Down Expand Up @@ -164,7 +180,7 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
}

/**
* Overload that uses the defaults for `Hdf5Options`.
* Overload that uses the default `CompressedSparseMatrixOptions`.
* @param nr Number of rows in the matrix.
* @param nc Number of columns in the matrix.
* @param file Path to the file.
Expand All @@ -175,8 +191,8 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
* This should have length equal to the number of rows (if `row = true`) or columns (otherwise).
* @param row Whether the matrix is stored in compressed sparse row format.
*/
Hdf5CompressedSparseMatrix(Index_ nr, Index_ nc, std::string file, std::string vals, std::string idx, std::string ptr, bool row) :
Hdf5CompressedSparseMatrix(nr, nc, std::move(file), std::move(vals), std::move(idx), std::move(ptr), row, Hdf5Options()) {}
CompressedSparseMatrix(Index_ nr, Index_ nc, std::string file, std::string vals, std::string idx, std::string ptr, bool row) :
CompressedSparseMatrix(nr, nc, std::move(file), std::move(vals), std::move(idx), std::move(ptr), row, CompressedSparseMatrixOptions()) {}

public:
Index_ nrow() const {
Expand Down Expand Up @@ -219,8 +235,8 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
************ Myopic dense ************
**************************************/
private:
Hdf5CompressedSparseMatrix_internal::MatrixDetails<Index_> details() const {
return Hdf5CompressedSparseMatrix_internal::MatrixDetails<Index_>(
CompressedSparseMatrix_internal::MatrixDetails<Index_> details() const {
return CompressedSparseMatrix_internal::MatrixDetails<Index_>(
file_name,
data_name,
index_name,
Expand All @@ -236,11 +252,11 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
template<bool oracle_>
std::unique_ptr<tatami::DenseExtractor<oracle_, Value_, Index_> > populate_dense(bool row, tatami::MaybeOracle<oracle_, Index_> oracle, const tatami::Options&) const {
if (row == csr) {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::PrimaryFullDense<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
return std::make_unique<CompressedSparseMatrix_internal::PrimaryFullDense<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
details(), std::move(oracle)
);
} else {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::SecondaryFullDense<oracle_, Value_, Index_, CachedValue_> >(
return std::make_unique<CompressedSparseMatrix_internal::SecondaryFullDense<oracle_, Value_, Index_, CachedValue_> >(
details(), std::move(oracle)
);
}
Expand All @@ -249,11 +265,11 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
template<bool oracle_>
std::unique_ptr<tatami::DenseExtractor<oracle_, Value_, Index_> > populate_dense(bool row, tatami::MaybeOracle<oracle_, Index_> oracle, Index_ block_start, Index_ block_length, const tatami::Options&) const {
if (row == csr) {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::PrimaryBlockDense<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
return std::make_unique<CompressedSparseMatrix_internal::PrimaryBlockDense<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
details(), std::move(oracle), block_start, block_length
);
} else {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::SecondaryBlockDense<oracle_, Value_, Index_, CachedValue_> >(
return std::make_unique<CompressedSparseMatrix_internal::SecondaryBlockDense<oracle_, Value_, Index_, CachedValue_> >(
details(), std::move(oracle), block_start, block_length
);
}
Expand All @@ -262,11 +278,11 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
template<bool oracle_>
std::unique_ptr<tatami::DenseExtractor<oracle_, Value_, Index_> > populate_dense(bool row, tatami::MaybeOracle<oracle_, Index_> oracle, tatami::VectorPtr<Index_> indices_ptr, const tatami::Options&) const {
if (row == csr) {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::PrimaryIndexDense<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
return std::make_unique<CompressedSparseMatrix_internal::PrimaryIndexDense<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
details(), std::move(oracle), std::move(indices_ptr)
);
} else {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::SecondaryIndexDense<oracle_, Value_, Index_, CachedValue_> >(
return std::make_unique<CompressedSparseMatrix_internal::SecondaryIndexDense<oracle_, Value_, Index_, CachedValue_> >(
details(), std::move(oracle), std::move(indices_ptr)
);
}
Expand All @@ -292,11 +308,11 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
template<bool oracle_>
std::unique_ptr<tatami::SparseExtractor<oracle_, Value_, Index_> > populate_sparse(bool row, tatami::MaybeOracle<oracle_, Index_> oracle, const tatami::Options& opt) const {
if (row == csr) {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::PrimaryFullSparse<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
return std::make_unique<CompressedSparseMatrix_internal::PrimaryFullSparse<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
details(), std::move(oracle), opt.sparse_extract_value, opt.sparse_extract_index
);
} else {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::SecondaryFullSparse<oracle_, Value_, Index_, CachedValue_> >(
return std::make_unique<CompressedSparseMatrix_internal::SecondaryFullSparse<oracle_, Value_, Index_, CachedValue_> >(
details(), std::move(oracle), opt.sparse_extract_value, opt.sparse_extract_index
);
}
Expand All @@ -305,11 +321,11 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
template<bool oracle_>
std::unique_ptr<tatami::SparseExtractor<oracle_, Value_, Index_> > populate_sparse(bool row, tatami::MaybeOracle<oracle_, Index_> oracle, Index_ block_start, Index_ block_length, const tatami::Options& opt) const {
if (row == csr) {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::PrimaryBlockSparse<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
return std::make_unique<CompressedSparseMatrix_internal::PrimaryBlockSparse<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
details(), std::move(oracle), block_start, block_length, opt.sparse_extract_value, opt.sparse_extract_index
);
} else {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::SecondaryBlockSparse<oracle_, Value_, Index_, CachedValue_> >(
return std::make_unique<CompressedSparseMatrix_internal::SecondaryBlockSparse<oracle_, Value_, Index_, CachedValue_> >(
details(), std::move(oracle), block_start, block_length, opt.sparse_extract_value, opt.sparse_extract_index
);
}
Expand All @@ -318,11 +334,11 @@ class Hdf5CompressedSparseMatrix : public tatami::Matrix<Value_, Index_> {
template<bool oracle_>
std::unique_ptr<tatami::SparseExtractor<oracle_, Value_, Index_> > populate_sparse(bool row, tatami::MaybeOracle<oracle_, Index_> oracle, tatami::VectorPtr<Index_> indices_ptr, const tatami::Options& opt) const {
if (row == csr) {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::PrimaryIndexSparse<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
return std::make_unique<CompressedSparseMatrix_internal::PrimaryIndexSparse<oracle_, Value_, Index_, CachedValue_, CachedIndex_> >(
details(), std::move(oracle), std::move(indices_ptr), opt.sparse_extract_value, opt.sparse_extract_index
);
} else {
return std::make_unique<Hdf5CompressedSparseMatrix_internal::SecondaryIndexSparse<oracle_, Value_, Index_, CachedValue_> >(
return std::make_unique<CompressedSparseMatrix_internal::SecondaryIndexSparse<oracle_, Value_, Index_, CachedValue_> >(
details(), std::move(oracle), std::move(indices_ptr), opt.sparse_extract_value, opt.sparse_extract_index
);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,49 @@
#include "H5Cpp.h"

#include <string>
#include <cstdint>
#include <type_traits>
#include <cmath>
#include <list>
#include <vector>

#include "serialize.hpp"
#include "utils.hpp"
#include "tatami_chunked/tatami_chunked.hpp"

/**
* @file Hdf5DenseMatrix.hpp
* @file DenseMatrix.hpp
*
* @brief Defines a class for a HDF5-backed dense matrix.
*/

namespace tatami_hdf5 {

/**
* @brief Options for `DenseMatrix` extraction.
*/
struct DenseMatrixOptions {
/**
* Size of the in-memory cache in bytes.
*
* We cache all chunks required to read a row/column in `tatami::MyopicDenseExtractor::fetch()` and related methods.
* This allows us to re-use the cached chunks when adjacent rows/columns are requested, rather than re-reading them from disk.
*
* Larger caches improve access speed at the cost of memory usage.
* Small values may be ignored if `require_minimum_cache` is `true`.
*/
size_t maximum_cache_size = 100000000;

/**
* Whether to automatically enforce a minimum size for the cache, regardless of `maximum_cache_size`.
* This minimum is chosen to ensure that all chunks overlapping one row (or a slice/subset thereof) can be retained in memory,
* so that the same chunks are not repeatedly re-read from disk when iterating over consecutive rows/columns of the matrix.
*/
bool require_minimum_cache = true;
};

/**
* @cond
*/
namespace Hdf5DenseMatrix_internal {
namespace DenseMatrix_internal {

// All HDF5-related members.
struct Components {
Expand Down Expand Up @@ -509,7 +530,7 @@ struct Index : public DenseBase<by_h5_row_, solo_, oracle_, Index_, CachedValue_
* if a smaller type is known to be able to store the values (based on their HDF5 type or other knowledge).
*/
template<typename Value_, typename Index_, typename CachedValue_ = Value_>
class Hdf5DenseMatrix : public tatami::Matrix<Value_, Index_> {
class DenseMatrix : public tatami::Matrix<Value_, Index_> {
std::string file_name, dataset_name;
bool transpose;

Expand All @@ -524,9 +545,11 @@ class Hdf5DenseMatrix : public tatami::Matrix<Value_, Index_> {
* @param file Path to the file.
* @param name Path to the dataset inside the file.
* @param transpose Whether the dataset is transposed in its storage order, i.e., rows in the HDF5 dataset correspond to columns of this matrix.
* @param options Further options.
* This may be true for HDF5 files generated by frameworks that use column-major matrices,
* where preserving the data layout between memory and disk is more efficient (see, e.g., the **rhdf5** Bioconductor package).
* @param options Further options for data extraction.
*/
Hdf5DenseMatrix(std::string file, std::string name, bool transpose, const Hdf5Options& options) :
DenseMatrix(std::string file, std::string name, bool transpose, const DenseMatrixOptions& options) :
file_name(std::move(file)),
dataset_name(std::move(name)),
transpose(transpose),
Expand Down Expand Up @@ -561,12 +584,13 @@ class Hdf5DenseMatrix : public tatami::Matrix<Value_, Index_> {
}

/**
* Overload that uses the defaults for `Hdf5Options`.
* Overload that uses the default `DenseMatrixOptions`.
* @param file Path to the file.
* @param name Path to the dataset inside the file.
* @param transpose Whether the dataset is transposed in its storage order, i.e., rows in the HDF5 dataset correspond to columns of this matrix.
* @param transpose Whether the dataset is transposed in its storage order.
*/
Hdf5DenseMatrix(std::string file, std::string name, bool transpose) : Hdf5DenseMatrix(std::move(file), std::move(name), transpose, Hdf5Options()) {}
DenseMatrix(std::string file, std::string name, bool transpose) :
DenseMatrix(std::move(file), std::move(name), transpose, DenseMatrixOptions()) {}

private:
bool prefer_rows_internal() const {
Expand Down Expand Up @@ -677,16 +701,16 @@ class Hdf5DenseMatrix : public tatami::Matrix<Value_, Index_> {
public:
std::unique_ptr<tatami::MyopicDenseExtractor<Value_, Index_> > dense(bool row, const tatami::Options&) const {
Index_ full_non_target = (row ? ncol_internal() : nrow_internal());
return populate<false, Hdf5DenseMatrix_internal::Full>(row, full_non_target, false, full_non_target);
return populate<false, DenseMatrix_internal::Full>(row, full_non_target, false, full_non_target);
}

std::unique_ptr<tatami::MyopicDenseExtractor<Value_, Index_> > dense(bool row, Index_ block_start, Index_ block_length, const tatami::Options&) const {
return populate<false, Hdf5DenseMatrix_internal::Block>(row, block_length, false, block_start, block_length);
return populate<false, DenseMatrix_internal::Block>(row, block_length, false, block_start, block_length);
}

std::unique_ptr<tatami::MyopicDenseExtractor<Value_, Index_> > dense(bool row, tatami::VectorPtr<Index_> indices_ptr, const tatami::Options&) const {
auto nidx = indices_ptr->size();
return populate<false, Hdf5DenseMatrix_internal::Index>(row, nidx, false, std::move(indices_ptr));
return populate<false, DenseMatrix_internal::Index>(row, nidx, false, std::move(indices_ptr));
}

/*********************
Expand Down Expand Up @@ -717,7 +741,7 @@ class Hdf5DenseMatrix : public tatami::Matrix<Value_, Index_> {
const tatami::Options&)
const {
Index_ full_non_target = (row ? ncol_internal() : nrow_internal());
return populate<true, Hdf5DenseMatrix_internal::Full>(row, full_non_target, std::move(oracle), full_non_target);
return populate<true, DenseMatrix_internal::Full>(row, full_non_target, std::move(oracle), full_non_target);
}

std::unique_ptr<tatami::OracularDenseExtractor<Value_, Index_> > dense(
Expand All @@ -727,7 +751,7 @@ class Hdf5DenseMatrix : public tatami::Matrix<Value_, Index_> {
Index_ block_length,
const tatami::Options&)
const {
return populate<true, Hdf5DenseMatrix_internal::Block>(row, block_length, std::move(oracle), block_start, block_length);
return populate<true, DenseMatrix_internal::Block>(row, block_length, std::move(oracle), block_start, block_length);
}

std::unique_ptr<tatami::OracularDenseExtractor<Value_, Index_> > dense(
Expand All @@ -737,7 +761,7 @@ class Hdf5DenseMatrix : public tatami::Matrix<Value_, Index_> {
const tatami::Options&)
const {
auto nidx = indices_ptr->size();
return populate<true, Hdf5DenseMatrix_internal::Index>(row, nidx, std::move(oracle), std::move(indices_ptr));
return populate<true, DenseMatrix_internal::Index>(row, nidx, std::move(oracle), std::move(indices_ptr));
}

/***********************
Expand Down
Loading

0 comments on commit 8ee376b

Please sign in to comment.