-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write metadata cache data to mappings _meta with refresh time update #805
Write metadata cache data to mappings _meta with refresh time update #805
Conversation
…rch-project#744) * write mock metadata cache data to mappings _meta Signed-off-by: Sean Kao <seankao@amazon.com> * Enable write to cache by default Signed-off-by: Sean Kao <seankao@amazon.com> * bugfix: _meta.latestId missing when create index Signed-off-by: Sean Kao <seankao@amazon.com> * set and unset config in test suite Signed-off-by: Sean Kao <seankao@amazon.com> * fix: use member flintSparkConf Signed-off-by: Sean Kao <seankao@amazon.com> --------- Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
add label to backport to the nexus branch. |
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
5f3af3b
to
7a8e1f3
Compare
* Handles refresh for refresh mode AUTO, which is used exclusively by auto refresh index with | ||
* internal scheduler. | ||
*/ | ||
private def refreshIndexAuto( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we update for auto refresh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now only track lastRefreshStartTime
and lastRefreshCompleteTime
for manual refresh and auto refresh with external scheduler.
for streaming job, we use createTime
to track the streaming job start time.
there's no mechanism for tracking start/end time for each micro batch update yet, so updating the 2 timestamp in the refresh could be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some comment
...ration/src/main/scala/org/opensearch/flint/spark/scheduler/util/IntervalSchedulerParser.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Sean Kao <seankao@amazon.com>
Note to any reviewer if curious, the force push only amended commit 2f58f56 and nothing else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we call it metadata Cache
? I was not quite sure the indication of cache
.
i do welcome a better name... was kind of struggling to come up with a name. I'm not too convinced that MetadataCache is the best one. |
Updated description with test results |
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/opensearch-spark/backport-0.5-nexus 0.5-nexus
# Navigate to the new working tree
pushd ../.worktrees/opensearch-spark/backport-0.5-nexus
# Create a new branch
git switch --create backport/backport-805-to-0.5-nexus
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 a07f88f86fa384d94e535f99397e8d0d0402bba0
# Push it to GitHub
git push --set-upstream origin backport/backport-805-to-0.5-nexus
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/opensearch-spark/backport-0.5-nexus Then, create a pull request where the |
Backport blocked by: #836 |
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/opensearch-spark/backport-0.5-nexus 0.5-nexus
# Navigate to the new working tree
pushd ../.worktrees/opensearch-spark/backport-0.5-nexus
# Create a new branch
git switch --create backport/backport-805-to-0.5-nexus
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 a07f88f86fa384d94e535f99397e8d0d0402bba0
# Push it to GitHub
git push --set-upstream origin backport/backport-805-to-0.5-nexus
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/opensearch-spark/backport-0.5-nexus Then, create a pull request where the |
…pensearch-project#805) * [0.5-nexus] Write mock metadata cache data to mappings _meta (opensearch-project#744) * write mock metadata cache data to mappings _meta Signed-off-by: Sean Kao <seankao@amazon.com> * Enable write to cache by default Signed-off-by: Sean Kao <seankao@amazon.com> * bugfix: _meta.latestId missing when create index Signed-off-by: Sean Kao <seankao@amazon.com> * set and unset config in test suite Signed-off-by: Sean Kao <seankao@amazon.com> * fix: use member flintSparkConf Signed-off-by: Sean Kao <seankao@amazon.com> --------- Signed-off-by: Sean Kao <seankao@amazon.com> * default metadata cache write disabled Signed-off-by: Sean Kao <seankao@amazon.com> * remove string literal "external" in index builder Signed-off-by: Sean Kao <seankao@amazon.com> * track refreshInterval and lastRefreshTime Signed-off-by: Sean Kao <seankao@amazon.com> * add last refresh timestamps to metadata log entry Signed-off-by: Sean Kao <seankao@amazon.com> * update metadata cache test case: should pass Signed-off-by: Sean Kao <seankao@amazon.com> * move to spark package; get refresh interval Signed-off-by: Sean Kao <seankao@amazon.com> * parse refresh interval Signed-off-by: Sean Kao <seankao@amazon.com> * minor syntax fix on FlintSpark.createIndex Signed-off-by: Sean Kao <seankao@amazon.com> * strategize cache writer interface Signed-off-by: Sean Kao <seankao@amazon.com> * update refresh timestamps in FlintSpark Signed-off-by: Sean Kao <seankao@amazon.com> * add test cases Signed-off-by: Sean Kao <seankao@amazon.com> * IT test for refresh timestamp update Signed-off-by: Sean Kao <seankao@amazon.com> * add doc for spark conf Signed-off-by: Sean Kao <seankao@amazon.com> * change mock table name Signed-off-by: Sean Kao <seankao@amazon.com> * add IT test at FlintSpark level Signed-off-by: Sean Kao <seankao@amazon.com> * test with external scheduler Signed-off-by: Sean Kao <seankao@amazon.com> * refactor refreshIndex method; add test for modes Signed-off-by: Sean Kao <seankao@amazon.com> * fix typo Signed-off-by: Sean Kao <seankao@amazon.com> * fix failed test caused by refactoring Signed-off-by: Sean Kao <seankao@amazon.com> * rename method; add comment Signed-off-by: Sean Kao <seankao@amazon.com> --------- Signed-off-by: Sean Kao <seankao@amazon.com> (cherry picked from commit a07f88f)
…805) (#840) * [0.5-nexus] Write mock metadata cache data to mappings _meta (#744) * write mock metadata cache data to mappings _meta Signed-off-by: Sean Kao <seankao@amazon.com> * Enable write to cache by default Signed-off-by: Sean Kao <seankao@amazon.com> * bugfix: _meta.latestId missing when create index Signed-off-by: Sean Kao <seankao@amazon.com> * set and unset config in test suite Signed-off-by: Sean Kao <seankao@amazon.com> * fix: use member flintSparkConf Signed-off-by: Sean Kao <seankao@amazon.com> --------- Signed-off-by: Sean Kao <seankao@amazon.com> * default metadata cache write disabled Signed-off-by: Sean Kao <seankao@amazon.com> * remove string literal "external" in index builder Signed-off-by: Sean Kao <seankao@amazon.com> * track refreshInterval and lastRefreshTime Signed-off-by: Sean Kao <seankao@amazon.com> * add last refresh timestamps to metadata log entry Signed-off-by: Sean Kao <seankao@amazon.com> * update metadata cache test case: should pass Signed-off-by: Sean Kao <seankao@amazon.com> * move to spark package; get refresh interval Signed-off-by: Sean Kao <seankao@amazon.com> * parse refresh interval Signed-off-by: Sean Kao <seankao@amazon.com> * minor syntax fix on FlintSpark.createIndex Signed-off-by: Sean Kao <seankao@amazon.com> * strategize cache writer interface Signed-off-by: Sean Kao <seankao@amazon.com> * update refresh timestamps in FlintSpark Signed-off-by: Sean Kao <seankao@amazon.com> * add test cases Signed-off-by: Sean Kao <seankao@amazon.com> * IT test for refresh timestamp update Signed-off-by: Sean Kao <seankao@amazon.com> * add doc for spark conf Signed-off-by: Sean Kao <seankao@amazon.com> * change mock table name Signed-off-by: Sean Kao <seankao@amazon.com> * add IT test at FlintSpark level Signed-off-by: Sean Kao <seankao@amazon.com> * test with external scheduler Signed-off-by: Sean Kao <seankao@amazon.com> * refactor refreshIndex method; add test for modes Signed-off-by: Sean Kao <seankao@amazon.com> * fix typo Signed-off-by: Sean Kao <seankao@amazon.com> * fix failed test caused by refactoring Signed-off-by: Sean Kao <seankao@amazon.com> * rename method; add comment Signed-off-by: Sean Kao <seankao@amazon.com> --------- Signed-off-by: Sean Kao <seankao@amazon.com> (cherry picked from commit a07f88f)
…pensearch-project#805) * [0.5-nexus] Write mock metadata cache data to mappings _meta (opensearch-project#744) * write mock metadata cache data to mappings _meta Signed-off-by: Sean Kao <seankao@amazon.com> * Enable write to cache by default Signed-off-by: Sean Kao <seankao@amazon.com> * bugfix: _meta.latestId missing when create index Signed-off-by: Sean Kao <seankao@amazon.com> * set and unset config in test suite Signed-off-by: Sean Kao <seankao@amazon.com> * fix: use member flintSparkConf Signed-off-by: Sean Kao <seankao@amazon.com> --------- Signed-off-by: Sean Kao <seankao@amazon.com> * default metadata cache write disabled Signed-off-by: Sean Kao <seankao@amazon.com> * remove string literal "external" in index builder Signed-off-by: Sean Kao <seankao@amazon.com> * track refreshInterval and lastRefreshTime Signed-off-by: Sean Kao <seankao@amazon.com> * add last refresh timestamps to metadata log entry Signed-off-by: Sean Kao <seankao@amazon.com> * update metadata cache test case: should pass Signed-off-by: Sean Kao <seankao@amazon.com> * move to spark package; get refresh interval Signed-off-by: Sean Kao <seankao@amazon.com> * parse refresh interval Signed-off-by: Sean Kao <seankao@amazon.com> * minor syntax fix on FlintSpark.createIndex Signed-off-by: Sean Kao <seankao@amazon.com> * strategize cache writer interface Signed-off-by: Sean Kao <seankao@amazon.com> * update refresh timestamps in FlintSpark Signed-off-by: Sean Kao <seankao@amazon.com> * add test cases Signed-off-by: Sean Kao <seankao@amazon.com> * IT test for refresh timestamp update Signed-off-by: Sean Kao <seankao@amazon.com> * add doc for spark conf Signed-off-by: Sean Kao <seankao@amazon.com> * change mock table name Signed-off-by: Sean Kao <seankao@amazon.com> * add IT test at FlintSpark level Signed-off-by: Sean Kao <seankao@amazon.com> * test with external scheduler Signed-off-by: Sean Kao <seankao@amazon.com> * refactor refreshIndex method; add test for modes Signed-off-by: Sean Kao <seankao@amazon.com> * fix typo Signed-off-by: Sean Kao <seankao@amazon.com> * fix failed test caused by refactoring Signed-off-by: Sean Kao <seankao@amazon.com> * rename method; add comment Signed-off-by: Sean Kao <seankao@amazon.com> --------- Signed-off-by: Sean Kao <seankao@amazon.com>
Description
Metadata Cache Writer
For the most part, same as
In addition to the regular metadata storage using
FlintIndexMetadataService
, we're dual-writing additional fields, defined byFlintMetadataCache
, to the index mappings_meta
field. It's intended for frontend users to access some crucial metadata for an index quickly without invoking another backend API call.This PR adds such fields for all indexes, if the spark config
spark.flint.metadataCacheWrite.enabled
is set to true._meta.properties.metadataCacheVersion
: "1.0"_meta.properties.refreshInterval
: Integer. Refresh interval of an index measured in seconds. This field is added only if index refresh type is auto refresh and refresh_interval is set_meta.properties.sourceTables
: Array of Strings. For now, it's mocked data. Update coming in later PR._meta.properties.lastRefreshTime
: Long. Timestamp in milliseconds when last refresh happened. This field is added only if index already gets refreshed at least onceLast Refresh Time
Added two new fields in
FlintMetadataLogEntry
and bumped version of its json doc from 1.0 to 1.1 (because adding new field but not changing existing fields)These are accurate only for manual refresh (full, incremental) and external scheduler for auto refresh.
For internal scheduler, the
jobStartTime
(orcreateTime
inFlintMetadataLogEntry
) is used to track streaming job start time.I'm not reusing
createTime
because they should be updated at different times.For createTime (for internal scheduler) it's during
refreshIndex
,recoverIndex
,updateIndexManualToAuto
But for lastRefreshStartTime and lastRefreshCompleteTime (for manual refresh and external scheduler) it's only updated in
refreshIndex
End-to-End Test
Tests performed in my test cluster
index (full refresh mode) created without `spark.flint.metadataCacheWrite.enabled`
Checking the
_meta.properties
field: No fields for metadata cache is addedindex (full refresh mode) created with `spark.flint.metadataCacheWrite.enabled` set to true
Checking the
_meta.properties
field:Triggering a full refresh
Check that
lastRefreshStartTime
is updated when index (full refresh mode) enters refreshingAnd
lastRefreshCompleteTime
updated when refresh is doneThe
_meta.properties.lastRefreshTime
is added as well whenspark.flint.metadataCacheWrite.enabled
set to trueTest with auto refresh with external scheduler with cache write enabled
_meta.properties.refreshInterval
is filled with number (unit: seconds) as expectedlastRefreshStartTime and lastRefreshCompleteTime are updated accordingly
lastRefreshTime also added to
_meta.properties
Test for auto refresh with internal scheduler
jobStartTime is updated for streaming job, but lastRefreshStart/CompleteTime isn't updated, as expected
Related Issues
_meta
as read cache for frontend user to access #746By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.