Custom Scratch website scraper (Voyager) #75

NotFenixio · 2023-09-04T11:51:18Z

NotFenixio
Sep 4, 2023
Maintainer

Recently, ScratchDB has been acting up, causing problems for Snazzle, which relies on it. But we can't just toss ScratchDB aside because it's our main source of info. So, here's an idea: Let's create our own ScratchDB.

To do this, we'd learn from ScratchDB's way of doing things. We can use PlayWright, Selenium, and/or Atoma to grab info from Scratch, and then BeautifulSoup to clean it up and get the data we need.

Now, this idea is kind of like a trial run, like taking a poll. I want to know what y'all think about it. Would this be a good move?

EngineerRunner · 2023-09-04T12:11:57Z

EngineerRunner
Sep 4, 2023
Maintainer

i mean, i'm mostly working on Pyratch now, but this'd be really helpful for all alternative frontends, so i support this idea.

0 replies

NotFenixio · 2023-09-04T12:37:56Z

NotFenixio
Sep 4, 2023
Maintainer Author

Awesome! We need a name... Any suggestions?

0 replies

NotFenixio · 2023-09-04T12:50:19Z

NotFenixio
Sep 4, 2023
Maintainer Author

I'll just call it ScratchedDB for now,

0 replies

redstone-dev · 2023-09-05T13:00:28Z

redstone-dev
Sep 5, 2023
Maintainer

I really like this idea, however, we'd need reliable hosting with the closest to 100% uptime we can get. I have an AWS account so we could try that, but it's really expensive so we'd need to get our money's worth out of it.

As for the name, we could call it Voyager. (Thanks ChatGPT)

We could also try writing it in Rust for funzies.

0 replies

NotFenixio · 2023-09-05T14:49:10Z

NotFenixio
Sep 5, 2023
Maintainer Author

Voyager then! I'll rename the repo in a moment.

The idea is to create a locally-deployable ScratchDB, so whoever downloads Snazzle or any other alternative frontend, will be hosting its own ScratchDB.

Also, I discovered that AWS has a 12-month free tier for Amazon EC2 which we can use to deploy this new thing for 1 year. (A simple Glitch project with UptimeRobot could do the thing too)

And for the Rust thing, I don't know... Let's try doing it in Python and leaving that for the future Snazzle Svelte/Rust port.

0 replies

redstone-dev · 2023-09-05T16:09:25Z

redstone-dev
Sep 5, 2023
Maintainer

And for the Rust thing, I don't know... Let's try doing it in Python and leaving that for the future Snazzle Svelte/Rust port.

Rust rewrite of Svelte??? /j </offtopic>

0 replies

davidtheplatform · 2023-09-06T21:45:54Z

davidtheplatform
Sep 6, 2023
Maintainer

Random suggestions:
Use requests and beautiful soup since it uses way less ram (also there are rss feeds but they don’t have every post)

Have a centralized server to reduce load on scratch but clients have a local cache/scraper in case the server goes down

Clients can choose whether they want stale data immediately or updates data that takes longer to get

0 replies

redstone-dev · 2023-09-06T22:21:12Z

redstone-dev
Sep 6, 2023
Maintainer

Clients can choose whether they want stale data immediately or updates data that takes longer to get

I think we could combine Voyager with a system on the client that checks if the RSS data has new posts that Voyager doesn't have yet, in which case it sends this data to the central Voyager server and then displays the new data to the user.

0 replies

davidtheplatform · 2023-09-06T22:45:12Z

davidtheplatform
Sep 6, 2023
Maintainer

Clients can choose whether they want stale data immediately or updates data that takes longer to get

I think we could combine Voyager with a system on the client that checks if the RSS data has new posts that Voyager doesn't have yet, in which case it sends this data to the central Voyager server and then displays the new data to the user.

Its probably better to avoid sending requests to Scratch if we don't have to.
What I meant was that if the client doesn't care about having the most up-to-date data it can tell the server that so the server doesn't have to make a request to the Scratch servers.

0 replies

redstone-dev · 2023-09-06T22:50:50Z

redstone-dev
Sep 6, 2023
Maintainer

Clients can choose whether they want stale data immediately or updates data that takes longer to get

I think we could combine Voyager with a system on the client that checks if the RSS data has new posts that Voyager doesn't have yet, in which case it sends this data to the central Voyager server and then displays the new data to the user.

Its probably better to avoid sending requests to Scratch if we don't have to. What I meant was that if the client doesn't care about having the most up-to-date data it can tell the server that so the server doesn't have to make a request to the Scratch servers.

To that end, we should also add rate-limiting (maybe only 3 requests a second?) to avoid stressing the Scratch servers. We may need to increase this number based on website traffic, though. Ideally the server should do this automatically somehow.

0 replies

ajskateboarder · 2023-09-07T14:56:36Z

ajskateboarder
Sep 7, 2023

We don't need any browser automation tools, Scratch forums are easy to fetch over HTTP requests

If we are using Rust, we can use the reqwest and scraper crates for data and serve it over actix. I can work on it whenever I have free time

To that end, we should also add rate-limiting (maybe only 3 requests a second?) to avoid stressing the Scratch servers. We may need to increase this number based on website traffic, though. Ideally the server should do this automatically somehow.

I think 3 requests/second is fine

0 replies

NotFenixio · 2023-09-07T15:42:54Z

NotFenixio
Sep 7, 2023
Maintainer Author

We're building it on Python, but we need some help with specific functions that require indexing Scratch. https://github.com/users/NotFenixio/projects/3/views/1

0 replies

davidtheplatform · 2023-09-08T01:05:09Z

davidtheplatform
Sep 8, 2023
Maintainer

I’m working on a scraper right now that uses SQLite

0 replies

davidtheplatform · 2023-09-08T02:03:26Z

davidtheplatform
Sep 8, 2023
Maintainer

Voyager then! I'll rename the repo in a moment.

The idea is to create a locally-deployable ScratchDB, so whoever downloads Snazzle or any other alternative frontend, will be hosting its own ScratchDB.

Also, I discovered that AWS has a 12-month free tier for Amazon EC2 which we can use to deploy this new thing for 1 year. (A simple Glitch project with UptimeRobot could do the thing too)

And for the Rust thing, I don't know... Let's try doing it in Python and leaving that for the future Snazzle Svelte/Rust port.

Depending on how much load there is I will probably be able to host it

0 replies

redstone-dev · 2023-09-08T23:33:28Z

redstone-dev
Sep 8, 2023
Maintainer

I’m working on a scraper right now that uses SQLite

Since Voyager is already being made by @NotFenixio, I had an idea.

When you both get your ideas usable in Snazzle, we can vote on the better one and we’ll use that. I might create my own entry as well.

Depending on how much load there is I will probably be able to host it

The idea is to create a more reliable service, so we should use the cloud for maximum uptime.

0 replies

EngineerRunner · 2024-03-31T07:44:03Z

EngineerRunner
Mar 31, 2024
Maintainer

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

0 replies

ajskateboarder · 2024-03-31T13:45:48Z

ajskateboarder
Mar 31, 2024

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers

0 replies

dynamixbot · 2024-04-01T03:52:03Z

dynamixbot
Apr 1, 2024
Maintainer

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

Well at least we can use it for Snazzle's projects and profiles and players and stuff.

0 replies

dynamixbot · 2024-04-01T03:52:34Z

dynamixbot
Apr 1, 2024
Maintainer

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers

Hey also how many bots would we need to scrape the forums?

Like equal to how many users active on the forums?

0 replies

davidtheplatform · 2024-04-01T03:54:36Z

davidtheplatform
Apr 1, 2024
Maintainer

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers

Hey also how many bots would we need to scrape the forums?

Like equal to how many users active on the forums?

How fast do you want it to be? Also the forums aren’t session based so # of bots doesn’t really mean anything

0 replies

ajskateboarder · 2024-04-01T15:04:49Z

ajskateboarder
Apr 1, 2024

Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers

Hey also how many bots would we need to scrape the forums?
Like equal to how many users active on the forums?

How fast do you want it to be? Also the forums aren’t session based so # of bots doesn’t really mean anything

I don't think the # of bots refers to number of accounts being used, just the number of scraping processes running in parallel

0 replies

dynamixbot · 2024-04-02T10:26:03Z

dynamixbot
Apr 2, 2024
Maintainer

Hey also how many bots would we need to scrape the forums?
Like equal to how many users active on the forums?

How fast do you want it to be? Also the forums aren’t session based so # of bots doesn’t really mean anything

Oh okay.

I don't think the # of bots refers to number of accounts being used, just the number of scraping processes running in parallel

Well I meant opposite of what you don't think. I was thinking that instead of loading everything and downloading it on cloud, we would only download it if needed or requested to be loaded. And one a page is loaded, other people don't have to go through the slow first view of a forum.

0 replies

NotFenixio · 2024-04-02T11:26:13Z

NotFenixio
Apr 2, 2024
Maintainer Author

I've deleted the Voyager repository available at my profile in favor of the new organization for Voyager, GetVoyager. The new Voyager version will be developed in the Voyager repository. By the way, should be have 3 separate repositories for Pioneer, Horizons, and the actual service?

0 replies

dynamixbot · 2024-04-03T16:26:44Z

dynamixbot
Apr 3, 2024
Maintainer

I've deleted the Voyager repository available at my profile in favor of the new organization for Voyager, GetVoyager. The new Voyager version will be developed in the Voyager repository. By the way, should be have 3 separate repositories for Pioneer, Horizons, and the actual service?

Subdirectories would be better than opening a whole new repository.

0 replies

dynamixbot · 2024-04-20T06:40:53Z

dynamixbot
Apr 20, 2024
Maintainer

need people for voyager

0 replies

dynamixbot · 2024-04-25T13:43:42Z

dynamixbot
Apr 25, 2024
Maintainer

@redstone-dev need people for voyager

0 replies

redstone-dev · 2024-05-07T23:10:10Z

redstone-dev
May 7, 2024
Maintainer

@redstone-dev need people for voyager

I think we could all work on Voyager and Snazzle at the same time, though I think you and @NotFenixio should decide on that, since you're basically the heads of the project.

0 replies

NotFenixio · 2024-05-08T05:59:26Z

NotFenixio
May 8, 2024
Maintainer Author

LGTM.

0 replies

Mrdev88 · 2024-05-08T17:50:44Z

Mrdev88
May 8, 2024

This idea is very good, I'll support this

0 replies

dynamixbot · 2024-05-22T12:52:42Z

dynamixbot
May 22, 2024
Maintainer

converting to discussion

1 reply

redstone-dev Jul 11, 2024
Maintainer

why? it was fine as an issue

Custom Scratch website scraper (Voyager) #75

NotFenixio Sep 4, 2023 Maintainer

Replies: 51 comments · 1 reply

EngineerRunner Sep 4, 2023 Maintainer

NotFenixio Sep 4, 2023 Maintainer Author

NotFenixio Sep 4, 2023 Maintainer Author

redstone-dev Sep 5, 2023 Maintainer

NotFenixio Sep 5, 2023 Maintainer Author

redstone-dev Sep 5, 2023 Maintainer

davidtheplatform Sep 6, 2023 Maintainer

redstone-dev Sep 6, 2023 Maintainer

davidtheplatform Sep 6, 2023 Maintainer

redstone-dev Sep 6, 2023 Maintainer

ajskateboarder Sep 7, 2023

NotFenixio Sep 7, 2023 Maintainer Author

davidtheplatform Sep 8, 2023 Maintainer

davidtheplatform Sep 8, 2023 Maintainer

redstone-dev Sep 8, 2023 Maintainer

EngineerRunner Mar 31, 2024 Maintainer

ajskateboarder Mar 31, 2024

dynamixbot Apr 1, 2024 Maintainer

dynamixbot Apr 1, 2024 Maintainer

davidtheplatform Apr 1, 2024 Maintainer

ajskateboarder Apr 1, 2024

dynamixbot Apr 2, 2024 Maintainer

NotFenixio Apr 2, 2024 Maintainer Author

dynamixbot Apr 3, 2024 Maintainer

dynamixbot Apr 20, 2024 Maintainer

dynamixbot Apr 25, 2024 Maintainer

redstone-dev May 7, 2024 Maintainer

NotFenixio May 8, 2024 Maintainer Author

Mrdev88 May 8, 2024

dynamixbot May 22, 2024 Maintainer

redstone-dev Jul 11, 2024 Maintainer

NotFenixio
Sep 4, 2023
Maintainer

Replies: 51 comments 1 reply

EngineerRunner
Sep 4, 2023
Maintainer

NotFenixio
Sep 4, 2023
Maintainer Author

NotFenixio
Sep 4, 2023
Maintainer Author

redstone-dev
Sep 5, 2023
Maintainer

NotFenixio
Sep 5, 2023
Maintainer Author

redstone-dev
Sep 5, 2023
Maintainer

davidtheplatform
Sep 6, 2023
Maintainer

redstone-dev
Sep 6, 2023
Maintainer

davidtheplatform
Sep 6, 2023
Maintainer

redstone-dev
Sep 6, 2023
Maintainer

ajskateboarder
Sep 7, 2023

NotFenixio
Sep 7, 2023
Maintainer Author

davidtheplatform
Sep 8, 2023
Maintainer

davidtheplatform
Sep 8, 2023
Maintainer

redstone-dev
Sep 8, 2023
Maintainer

EngineerRunner
Mar 31, 2024
Maintainer

ajskateboarder
Mar 31, 2024

dynamixbot
Apr 1, 2024
Maintainer

dynamixbot
Apr 1, 2024
Maintainer

davidtheplatform
Apr 1, 2024
Maintainer

ajskateboarder
Apr 1, 2024

dynamixbot
Apr 2, 2024
Maintainer

NotFenixio
Apr 2, 2024
Maintainer Author

dynamixbot
Apr 3, 2024
Maintainer

dynamixbot
Apr 20, 2024
Maintainer

dynamixbot
Apr 25, 2024
Maintainer

redstone-dev
May 7, 2024
Maintainer

NotFenixio
May 8, 2024
Maintainer Author

Mrdev88
May 8, 2024

dynamixbot
May 22, 2024
Maintainer

redstone-dev Jul 11, 2024
Maintainer