Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Support #14

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Database Support #14

wants to merge 6 commits into from

Conversation

Araon
Copy link

@Araon Araon commented Oct 15, 2020

In context for issue #3

I've added database support to store the posted issue on a sqlite3 database.
To avoid the same issue being tweeted out twice, the url of the table has the unique attribute,
As the modifications are too large of a change so, i want to make separate commits to ease process of code review.
I intend to add the two new features in the next commit.

TODO

Add the issue on the database before being posted on twitter
Delete the issue that has been posted

@Araon Araon marked this pull request as draft October 15, 2020 18:45
@Araon Araon marked this pull request as ready for review October 15, 2020 18:45
@Araon Araon marked this pull request as draft October 15, 2020 18:45
@TaniaMalhotra TaniaMalhotra marked this pull request as ready for review October 16, 2020 04:30
@TaniaMalhotra TaniaMalhotra marked this pull request as draft October 16, 2020 04:43
@TaniaMalhotra
Copy link
Owner

TaniaMalhotra commented Oct 16, 2020

Hii @Araon
The progress looks good so far 👍
Can you elaborate on how do you plan to use del_last_data() function? I interpreted that you're trying to delete the stored issues (a packet of 30 issues)received from the current API request once they have been tweeted. If it is so, then how do you think to compare the issues from the next API call to the previous call?

There might be a possibility that the out of 30 issues received in the new call, say x issues are those that were already there in the previous call and are being received again Because there is no fixed rate at which issues are created.

@Araon
Copy link
Author

Araon commented Oct 16, 2020

the del_last_data() is used to delete the last data enterd into the database
Yeah so, that thing can be avoided with either of these two options

  • To increase the batch size to 60(as for now only 30 issues are stored at a time, that can be increased to 60 or more to avoid repeting of issue)
  • To maintain a separate table where the some kind of hashed version of the issue link + the title can be stored, so that every time any new issue is inserted into the database, it can cross referance that with this table.

the first option is good but there's always a chance that it might repeat outside the 60 batch size, and the second option is better but it's a bit for such a simple bot.
Let me know of what you think of these approach or feel free to correct me if i have missed something
(i'm kinda new to this all open source contributing)

@TaniaMalhotra
Copy link
Owner

Yeah we can increase the batch size to 60

@Araon
Copy link
Author

Araon commented Oct 17, 2020

okay great, I'll push the update asap

@Araon Araon marked this pull request as ready for review October 18, 2020 20:34
@Araon
Copy link
Author

Araon commented Oct 18, 2020

I've refactored the code and removed the part of the code that was checking for dublicate issue in memory, which is now done in the database during insertion. The database deletes the issue that is already been tweeted out.

@TaniaMalhotra
Copy link
Owner

Hii @Araon
Your efforts are amazing and the quality of code is at par.
But either of us seem to be missing a point here
As per my understanding, if del_last_data() will delete the issue as soon as it's tweeted out, it would no longer be available for comparison with the next batch of issues.
So the following code in insert_data(issue_title,issue_url)

        pass
        logger.info("Dublicate Data, not inserted")

is actually never getting executed!
I decreased the batch size to 3 issues per call for the sake of testing. Technically, the repeated issues should not have entered the db but they are.

Screenshot (438)

For further making the point clear, github API call is now fetching the latest 3 issues in the first call. In the next call(since the batch size is deliberately reduced to 3), the issues fetched would be more or less the same. The required behavior is that the repeated issues do not enter the db and only the unique ones do.

@TaniaMalhotra
Copy link
Owner

TaniaMalhotra commented Oct 20, 2020

I am always open to new approaches and I might be missing out your point here.
We can always discuss further and you are free to correct me wherever I misunderstand your approach.

@Araon
Copy link
Author

Araon commented Oct 28, 2020

Sorry for the late reply!
The problem you said above is something i saw coming but did not know how to resolve. The first thing that came to my mind is, do we really need to delete posted issues from the database as soon as they are posted? we can do either of these

  • We can do something like a time based cleanup procedure where it refreshes the database after a certain period of time
  • Hit the github api to fetch one issue at a time when a new issue is posted from the database, so that we can have a some buffer issuses instore if the issue fetched from github is a dublicate.

Let me know what you think about this or if you have any idea how we should tackel this problem.

@TaniaMalhotra
Copy link
Owner

Hii @Araon
It's good to see you here again!
We can go with a time-based cleanup.
But in that case, do you plan to run a search from the start to the end of the db every time we search for duplicates before tweeting?

@Araon
Copy link
Author

Araon commented Nov 3, 2020

Searching for duplicate will not be necessary, because the database will only have unique issues, I've made that sure by making the issue_link column to have the unique attribute.
The way I think we can execute the time-based cleanup is that we can store the tweets we fetch from the GitHub API for a certain amount of time, let's say 7days. After 7 days we check if any of the issues is untweeted(we do this by making an extra column in the table that stores a bool indicating if it is tweeted or not). We delete the issues that are tweeted.

@TaniaMalhotra
Copy link
Owner

TaniaMalhotra commented Nov 4, 2020

Sounds good
You might start working on it and comment here in case you get stuck

@Araon
Copy link
Author

Araon commented Nov 9, 2020

I've added a schedule based database deletion method. it clears the database of all the issue that has been tweeted, denoted by the tweeted colume. Deletion takes place every 15 days. Let me know if i've missed anything 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants