Skip to content

Commit

Permalink
Add new nvtext tokenized minhash API (#17944)
Browse files Browse the repository at this point in the history
Creates a new minhash API that works on ngrams of row elements given a list column of strings.

```
std::unique_ptr<cudf::column> minhash_ngrams(
  cudf::lists_column_view const& input,
  cudf::size_type ngrams,
  uint32_t seed,
  cudf::device_span<uint32_t const> parameter_a,
  cudf::device_span<uint32_t const> parameter_b,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);
```

The input column is expected to be rows of words (strings) and each row is hashed using a sliding window of words (ngrams) and then the permuted algorithm is re-used to produce the minhash values.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Ayush Dattagupta (https://github.com/ayushdg)
  - Matthew Murray (https://github.com/Matt711)
  - Yunsong Wang (https://github.com/PointKernel)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Shruti Shivakumar (https://github.com/shrshi)

URL: #17944
  • Loading branch information
davidwendt authored Feb 27, 2025
1 parent 08ea13a commit 4fda491
Show file tree
Hide file tree
Showing 10 changed files with 911 additions and 92 deletions.
94 changes: 94 additions & 0 deletions cpp/include/nvtext/minhash.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -125,5 +125,99 @@ std::unique_ptr<cudf::column> minhash64(
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Returns the minhash values for each input row
*
* This function uses MurmurHash3_x86_32 for the hash algorithm.
*
* The input row is first hashed using the given `seed` over a sliding window
* of `ngrams` of strings. These hash values are then combined with the `a`
* and `b` parameter values using the following formula:
* ```
* max_hash = max of uint32
* mp = (1 << 61) - 1
* hv[i] = hash value of a ngrams at i
* pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash
* ```
*
* This calculation is performed on each set of ngrams and the minimum value
* is computed as follows:
* ```
* mh[j,i] = min(pv[i]) for all ngrams in row j
* and where i=[0,a.size())
* ```
*
* Any null row entries result in corresponding null output rows.
*
* @throw std::invalid_argument if the ngrams < 2
* @throw std::invalid_argument if parameter_a is empty
* @throw std::invalid_argument if `parameter_b.size() != parameter_a.size()`
* @throw std::overflow_error if `parameter_a.size() * input.size()` exceeds the column size limit
*
* @param input Strings column to compute minhash
* @param ngrams The number of strings to hash within each row
* @param seed Seed value used for the hash algorithm
* @param parameter_a Values used for the permuted calculation
* @param parameter_b Values used for the permuted calculation
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return List column of minhash values for each string per seed
*/
std::unique_ptr<cudf::column> minhash_ngrams(
cudf::lists_column_view const& input,
cudf::size_type ngrams,
uint32_t seed,
cudf::device_span<uint32_t const> parameter_a,
cudf::device_span<uint32_t const> parameter_b,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Returns the minhash values for each input row
*
* This function uses MurmurHash3_x64_128 for the hash algorithm.
*
* The input row is first hashed using the given `seed` over a sliding window
* of `ngrams` of strings. These hash values are then combined with the `a`
* and `b` parameter values using the following formula:
* ```
* max_hash = max of uint64
* mp = (1 << 61) - 1
* hv[i] = hash value of a ngrams at i
* pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash
* ```
*
* This calculation is performed on each set of ngrams and the minimum value
* is computed as follows:
* ```
* mh[j,i] = min(pv[i]) for all ngrams in row j
* and where i=[0,a.size())
* ```
*
* Any null row entries result in corresponding null output rows.
*
* @throw std::invalid_argument if the ngrams < 2
* @throw std::invalid_argument if parameter_a is empty
* @throw std::invalid_argument if `parameter_b.size() != parameter_a.size()`
* @throw std::overflow_error if `parameter_a.size() * input.size()` exceeds the column size limit
*
* @param input List strings column to compute minhash
* @param ngrams The number of strings to hash within each row
* @param seed Seed value used for the hash algorithm
* @param parameter_a Values used for the permuted calculation
* @param parameter_b Values used for the permuted calculation
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return List column of minhash values for each string per seed
*/
std::unique_ptr<cudf::column> minhash64_ngrams(
cudf::lists_column_view const& input,
cudf::size_type ngrams,
uint64_t seed,
cudf::device_span<uint64_t const> parameter_a,
cudf::device_span<uint64_t const> parameter_b,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/** @} */ // end of group
} // namespace CUDF_EXPORT nvtext
Loading

0 comments on commit 4fda491

Please sign in to comment.