Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex subset of vertices #4726

Merged
merged 1 commit into from
Jan 23, 2025
Merged

Conversation

ntisseyre
Copy link
Contributor

@ntisseyre ntisseyre commented Nov 15, 2024

Summary

This PR introduces a significant optimization to the reindexing process in JanusGraph by allowing a subset of vertices to be reindexed instead of scanning the entire storage.
This enhancement provides substantial performance improvements, primarily when the specific subset of vertices for indexing is already known.

NOTE

This feature is currently supported only for CQL storage. Other storage backends still need to be implemented.

KeyColumnValueStore.java

KeyIterator getKeys(final List<StaticBuffer> keys, final SliceQuery query, final StoreTransaction txh) throws BackendException {
        throw new NotImplementedException();
    }

Motivation

Previously, reindexing required scanning all vertices in storage, which could be highly resource-intensive and time-consuming, particularly in large datasets.
This update enables users to focus on a targeted subset of vertices, reducing the time and computational load for reindexing. This is especially beneficial in environments where only specific vertices are relevant to a given index or data update.

Changes

  • Added the ability to specify a subset of vertices to include in the reindexing process.
  • Optimized the indexing engine to skip unnecessary vertices, focusing only on those specified in the subset.

API in JanusGraphManagement

  /**
     * Updates the provided index according to the given {@link SchemaAction} for
     * the given subset of vertices.
     *
     * @param index
     * @param updateAction
     * @param vertexOnly Set of vertexIds that only should be considered for index update
     * @return a future that completes when the index action is done
     */
ScanJobFuture updateIndex(Index index, SchemaAction updateAction, List<Object> vertexOnly);

Benefits

  • Improved Performance: By narrowing down the scope of vertices, the reindexing process is much faster and more efficient.
  • Resource Optimization: Reduces CPU and memory usage during reindexing by avoiding a full scan.
    Enhanced Flexibility: This feature allows users to update specific sections of the graph more easily without impacting the entire dataset.

Backward Compatibility

This feature is backward compatible and does not impact existing functionality. Users not specifying a subset will still experience the previous behavior of scanning the entire storage.

@ntisseyre ntisseyre force-pushed the reindex_subset branch 2 times, most recently from e8b5d2d to b0bde5b Compare November 15, 2024 04:22
@ntisseyre ntisseyre force-pushed the reindex_subset branch 3 times, most recently from c949e47 to 7879876 Compare November 16, 2024 13:47
@porunov porunov added this to the 1.2.0 milestone Nov 21, 2024
Copy link
Member

@porunov porunov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ntisseyre !
Looks great! I have just two small comments.

Signed-off-by: ntisseyre <ntisseyre@apple.com>
Copy link
Member

@porunov porunov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @ntisseyre !

@ntisseyre ntisseyre merged commit 53a5332 into JanusGraph:master Jan 23, 2025
172 of 174 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants