-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Index GitHub Events to the Metrics cluster #76
Comments
The data ingestion problem is a very common one. My general feedback is that because GitHub API is heavily rate-limited we should be storing raw data coming from GitHub somewhere (e.g. S3) first, then having a process that ingests that data into the metrics cluster as close as possible to it original format, then separately aggregating it for the needs of our applications / dashboards, potentially just using OpenSearch aggregation capabilities. This solves a number of general problems.
|
Thanks for the feedback dB! Currently the proposed design is to use the automation app, which in a push-based way, can listen on incoming events: Then after an event is heard, index the events as raw data into an index in the Metrics OpenSearch cluster. Then as a part of #75, the raw data in the index will be reindexed into another index specifically for the purposes of the Maintainer Dashboard. Finally, OSD is used to render the Maintainer Dashboard. The way the current design differs from your suggestion is that the raw data we index using the automation app is not that close to the original format. It only contains these fields:
Also, currently these are the only events that will be listened to: This makes the raw data fairly specific to the Maintainer Dashboard use case. Note: We added these limitations to the raw data ingestion because of some concerns of if our OpenSearch cluster can handle all those events with that amount of data. Do you suggest we use the automation app to store events with all the available data included, and instead of using an OpenSearch cluster, store them in something like S3? This would fully separate the obtaining the data with the aggregation. Additionally, a drawback of using the automation app is if the app goes down, doing a backfill is not trivial. This may be relevant if we are building a generic store for GitHub Events |
After some discussion, seems like creating a data lake for GitHub Events would be very useful in the future. The proposed design is to use the automation app to upload the events as raw data to a Metrics S3 Bucket. Then, we can index portions of the raw data to the Metrics cluster to leverage its search capability for the use case of the Maintainer Dashboard. |
Continuing the discussion from opensearch-project/automation-app#24 (comment), we can go with something like, based on the list of events https://probot.github.io/api/latest/classes/context.Context.html#name
Along with it add tags The above should allow
Rest we can fetch the documents from s3 and index to the OpenSearch cluster for more complex filtering. |
I like it. I wouldn't go overboard in treating S3 as a database though, most importantly you want an ability to quickly replay events for time windows to (re)ingest them in a cluster where you can actually aggregate, sort, etc. |
Thanks dB, this flatten structure |
Agree with S3 approach, this will allows us to have our own data lake of all github events and the consumer can pick and choose how to process it. However streaming data from github bot to S3 may not be straightforward, I believe it we can use stream data from github automation bot to Kinesis Data Firehose, buffer it till a appropriate size, say 100Mb, and then write it to S3. The write event to S3 can trigger logic to process the data and index wherever required. |
Thanks @rishabh6788, @bshien has created a PR opensearch-project/automation-app#24, for every event the app listens it will upload the s3 bucket. We can initially start with this flow and if its bombarding with too many events and if had some API limitations (where upload is failing), yes we can use some staging tool in between and later push to s3 after certain threshold. |
Is your feature request related to a problem?
Coming from #75
In order to index data about
maintainer_engagement
, there first needs to be GitHub Events indexed into the Metrics cluster.What solution would you like?
There should be an index in the Metrics cluster called
github-activity-events
that has documents representing GitHub Events created in the OpenSearch project.Using the GitHub Automation App, listen on GitHub Events created by the
opensearch-project
organization. Index a document for each event with these fields:Document will look like:
What alternatives have you considered?
An alternative is using the GitHub Events API to query past Events, but because it is a pull-based system, it is not trivial to add only new Events that have not already been indexed into the cluster.
Do you have any additional context?
#57
The text was updated successfully, but these errors were encountered: