Custom Scratch website scraper (Voyager) #75
Replies: 51 comments 1 reply
-
i mean, i'm mostly working on Pyratch now, but this'd be really helpful for all alternative frontends, so i support this idea. |
Beta Was this translation helpful? Give feedback.
-
Awesome! We need a name... Any suggestions? |
Beta Was this translation helpful? Give feedback.
-
I'll just call it ScratchedDB for now, |
Beta Was this translation helpful? Give feedback.
-
I really like this idea, however, we'd need reliable hosting with the closest to 100% uptime we can get. I have an AWS account so we could try that, but it's really expensive so we'd need to get our money's worth out of it. As for the name, we could call it Voyager. (Thanks ChatGPT) We could also try writing it in Rust for funzies. |
Beta Was this translation helpful? Give feedback.
-
Voyager then! I'll rename the repo in a moment. The idea is to create a locally-deployable ScratchDB, so whoever downloads Snazzle or any other alternative frontend, will be hosting its own ScratchDB. Also, I discovered that AWS has a 12-month free tier for Amazon EC2 which we can use to deploy this new thing for 1 year. (A simple Glitch project with UptimeRobot could do the thing too) And for the Rust thing, I don't know... Let's try doing it in Python and leaving that for the future Snazzle Svelte/Rust port. |
Beta Was this translation helpful? Give feedback.
-
Rust rewrite of Svelte??? /j </offtopic> |
Beta Was this translation helpful? Give feedback.
-
Random suggestions: Have a centralized server to reduce load on scratch but clients have a local cache/scraper in case the server goes down Clients can choose whether they want stale data immediately or updates data that takes longer to get |
Beta Was this translation helpful? Give feedback.
-
I think we could combine Voyager with a system on the client that checks if the RSS data has new posts that Voyager doesn't have yet, in which case it sends this data to the central Voyager server and then displays the new data to the user. |
Beta Was this translation helpful? Give feedback.
-
Its probably better to avoid sending requests to Scratch if we don't have to. |
Beta Was this translation helpful? Give feedback.
-
To that end, we should also add rate-limiting (maybe only 3 requests a second?) to avoid stressing the Scratch servers. We may need to increase this number based on website traffic, though. Ideally the server should do this automatically somehow. |
Beta Was this translation helpful? Give feedback.
-
We don't need any browser automation tools, Scratch forums are easy to fetch over HTTP requests If we are using Rust, we can use the reqwest and scraper crates for data and serve it over actix. I can work on it whenever I have free time
I think 3 requests/second is fine |
Beta Was this translation helpful? Give feedback.
-
We're building it on Python, but we need some help with specific functions that require indexing Scratch. https://github.com/users/NotFenixio/projects/3/views/1 |
Beta Was this translation helpful? Give feedback.
-
I’m working on a scraper right now that uses SQLite |
Beta Was this translation helpful? Give feedback.
-
Depending on how much load there is I will probably be able to host it |
Beta Was this translation helpful? Give feedback.
-
Since Voyager is already being made by @NotFenixio, I had an idea. When you both get your ideas usable in Snazzle, we can vote on the better one and we’ll use that. I might create my own entry as well.
The idea is to create a more reliable service, so we should use the cloud for maximum uptime. |
Beta Was this translation helpful? Give feedback.
-
there's no forums API. that's the entire point of ScratchDB, and now Voyager. |
Beta Was this translation helpful? Give feedback.
-
Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers |
Beta Was this translation helpful? Give feedback.
-
Well at least we can use it for Snazzle's projects and profiles and players and stuff. |
Beta Was this translation helpful? Give feedback.
-
Hey also how many bots would we need to scrape the forums? Like equal to how many users active on the forums? |
Beta Was this translation helpful? Give feedback.
-
How fast do you want it to be? Also the forums aren’t session based so # of bots doesn’t really mean anything |
Beta Was this translation helpful? Give feedback.
-
I don't think the # of bots refers to number of accounts being used, just the number of scraping processes running in parallel |
Beta Was this translation helpful? Give feedback.
-
Oh okay.
Well I meant opposite of what you don't think. I was thinking that instead of loading everything and downloading it on cloud, we would only download it if needed or requested to be loaded. And one a page is loaded, other people don't have to go through the slow first view of a forum. |
Beta Was this translation helpful? Give feedback.
-
I've deleted the Voyager repository available at my profile in favor of the new organization for Voyager, GetVoyager. The new Voyager version will be developed in the Voyager repository. By the way, should be have 3 separate repositories for Pioneer, Horizons, and the actual service? |
Beta Was this translation helpful? Give feedback.
-
Subdirectories would be better than opening a whole new repository. |
Beta Was this translation helpful? Give feedback.
-
need people for voyager |
Beta Was this translation helpful? Give feedback.
-
@redstone-dev need people for voyager |
Beta Was this translation helpful? Give feedback.
-
I think we could all work on Voyager and Snazzle at the same time, though I think you and @NotFenixio should decide on that, since you're basically the heads of the project. |
Beta Was this translation helpful? Give feedback.
-
This idea is very good, I'll support this |
Beta Was this translation helpful? Give feedback.
-
converting to discussion |
Beta Was this translation helpful? Give feedback.
-
Recently, ScratchDB has been acting up, causing problems for Snazzle, which relies on it. But we can't just toss ScratchDB aside because it's our main source of info. So, here's an idea: Let's create our own ScratchDB.
To do this, we'd learn from ScratchDB's way of doing things. We can use PlayWright, Selenium, and/or Atoma to grab info from Scratch, and then BeautifulSoup to clean it up and get the data we need.
Now, this idea is kind of like a trial run, like taking a poll. I want to know what y'all think about it. Would this be a good move?
Beta Was this translation helpful? Give feedback.
All reactions