Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Index GitHub Events to the Metrics cluster #76

Closed
Tracked by #57
bshien opened this issue Sep 18, 2024 · 8 comments
Closed
Tracked by #57

[FEATURE] Index GitHub Events to the Metrics cluster #76

bshien opened this issue Sep 18, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@bshien
Copy link
Collaborator

bshien commented Sep 18, 2024

Is your feature request related to a problem?

Coming from #75

In order to index data about maintainer_engagement, there first needs to be GitHub Events indexed into the Metrics cluster.

What solution would you like?

There should be an index in the Metrics cluster called github-activity-events that has documents representing GitHub Events created in the OpenSearch project.

Using the GitHub Automation App, listen on GitHub Events created by the opensearch-project organization. Index a document for each event with these fields:

{
  id, // Unique identifier for the event.
  org.name, // The name of the organization
  repo.name, // The name of the repository.
  type, // The type of event.
  action, // The action that was performed(opened, edited, closed, etc.)
  sender.login, // The username of the actor that triggered the event.
  created_at // The date and time the event was triggered.
}

Document will look like:

{
    "id": "acfc0636-472e-440f-9693-5db93d999fe5",
    "organization": "opensearch-project",
    "repository": "opensearch-metrics",
    "type": "issues",
    "action": "opened",
    "sender": "bshien",
    "created_at": "2024-08-27T00:31:56Z"
}

What alternatives have you considered?

An alternative is using the GitHub Events API to query past Events, but because it is a pull-based system, it is not trivial to add only new Events that have not already been indexed into the cluster.

Do you have any additional context?

#57

@bshien bshien added enhancement New feature or request untriaged Issues that have not yet been triaged labels Sep 18, 2024
@bshien bshien self-assigned this Sep 18, 2024
@bshien bshien removed the untriaged Issues that have not yet been triaged label Sep 18, 2024
@github-actions github-actions bot added the untriaged Issues that have not yet been triaged label Sep 18, 2024
@bshien bshien moved this from 🆕 New to 🏗 In progress in Engineering Effectiveness Board Sep 18, 2024
@bshien bshien removed the untriaged Issues that have not yet been triaged label Sep 23, 2024
@dblock
Copy link
Member

dblock commented Oct 7, 2024

The data ingestion problem is a very common one. My general feedback is that because GitHub API is heavily rate-limited we should be storing raw data coming from GitHub somewhere (e.g. S3) first, then having a process that ingests that data into the metrics cluster as close as possible to it original format, then separately aggregating it for the needs of our applications / dashboards, potentially just using OpenSearch aggregation capabilities. This solves a number of general problems.

  1. If we need to change the data format in the cluster we can replay from the raw storage.
  2. If we need a new aggregation we can produce it from the raw data ingested.
  3. There's a clear separation of concerns between obtaining the data, aggregating it, and rendering it.

@bshien
Copy link
Collaborator Author

bshien commented Oct 7, 2024

Thanks for the feedback dB!

Currently the proposed design is to use the automation app, which in a push-based way, can listen on incoming events:
https://github.com/opensearch-project/automation-app

Then after an event is heard, index the events as raw data into an index in the Metrics OpenSearch cluster.

Then as a part of #75, the raw data in the index will be reindexed into another index specifically for the purposes of the Maintainer Dashboard. Finally, OSD is used to render the Maintainer Dashboard.

The way the current design differs from your suggestion is that the raw data we index using the automation app is not that close to the original format. It only contains these fields:

{
  "id": "acfc0636-472e-440f-9693-5db93d999fe5",
  "organization": "opensearch-project",
  "repository": "opensearch-metrics",
  "type": "issues",
  "action": "opened",
  "sender": "bshien",
  "created_at": "2024-08-27T00:31:56Z"
}

Also, currently these are the only events that will be listened to:
https://github.com/opensearch-project/automation-app/blob/main/configs/operations/github-activity-events-monitor.yml

This makes the raw data fairly specific to the Maintainer Dashboard use case.

Note: We added these limitations to the raw data ingestion because of some concerns of if our OpenSearch cluster can handle all those events with that amount of data.

Do you suggest we use the automation app to store events with all the available data included, and instead of using an OpenSearch cluster, store them in something like S3?

This would fully separate the obtaining the data with the aggregation.

Additionally, a drawback of using the automation app is if the app goes down, doing a backfill is not trivial. This may be relevant if we are building a generic store for GitHub Events

@bshien
Copy link
Collaborator Author

bshien commented Oct 8, 2024

After some discussion, seems like creating a data lake for GitHub Events would be very useful in the future. The proposed design is to use the automation app to upload the events as raw data to a Metrics S3 Bucket. Then, we can index portions of the raw data to the Metrics cluster to leverage its search capability for the use case of the Maintainer Dashboard.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Oct 8, 2024

Continuing the discussion from opensearch-project/automation-app#24 (comment), we can go with something like, based on the list of events https://probot.github.io/api/latest/classes/context.Context.html#name

s3://opensearch-project-github-events/<event_name>/<date>/repo_name-uuid

Along with it add tags event_type, repo_name, event_date (https://repost.aws/questions/QUxBzMJVu0Sd2uMeBERzXlVA/query-s3-objects-on-tags-values.)

The above should allow

  • Repo based filtering.
  • Allow time based filtering
  • Allow event based filtering.
  • Possible to get all the events for a repo (without considering the timeframe), by looping through all the events and dates and filter for files with repo_name-*.
  • Possible to get all the events for a repo within a certain time range.
  • Possible to get all the events (without considering the repo), by looping through the 1st prefix s3://github-events/<>.
  • Easy to manage the s3 bucket as in future if required we can delete all events based on certain lifecycle management conditions.

Rest we can fetch the documents from s3 and index to the OpenSearch cluster for more complex filtering.

@dblock @bshien @getsaurabh02 @peterzhuamazon WDYT?

@dblock
Copy link
Member

dblock commented Oct 8, 2024

I like it. I wouldn't go overboard in treating S3 as a database though, most importantly you want an ability to quickly replay events for time windows to (re)ingest them in a cluster where you can actually aggregate, sort, etc.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Oct 8, 2024

Thanks dB, this flatten structure s3://opensearch-project-github-events/<event_name>/<date>/repo_name-uuid would allow us to quickly get the right and required set of documents and later we can index them to the cluster, rest for all other complex queries and operations the idea is to use OpenSearch cluster.

@rishabh6788
Copy link

rishabh6788 commented Oct 8, 2024

Agree with S3 approach, this will allows us to have our own data lake of all github events and the consumer can pick and choose how to process it. However streaming data from github bot to S3 may not be straightforward, I believe it we can use stream data from github automation bot to Kinesis Data Firehose, buffer it till a appropriate size, say 100Mb, and then write it to S3.

The write event to S3 can trigger logic to process the data and index wherever required.
Kinesis Firehose is a powerful service that acquires, transforms, and delivers data streams.
It also has direct integration with OpenSearch service as well.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Oct 8, 2024

Thanks @rishabh6788, @bshien has created a PR opensearch-project/automation-app#24, for every event the app listens it will upload the s3 bucket. We can initially start with this flow and if its bombarding with too many events and if had some API limitations (where upload is failing), yes we can use some staging tool in between and later push to s3 after certain threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants