Skip to content

Commit 7d7fdf7

Browse files
committed
fix code markdown
1 parent 2ec1e0d commit 7d7fdf7

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

README.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -134,17 +134,17 @@ yarn deploy:docker:pi
134134

135135
### Data source
136136

137-
Initially app started with scraping original https://news.ycombinator.com thread pages and inside the [docs/old-code/scraper](docs/old-code/scraper) folder there is still complete scraper implementation. It uses `axios` for fetching html but soon I discovered that their server has very strict rate limiting and such scrapper was getting a lot of unpredictable `ETIMEDOUT` and `ECONNRESET` connection abrupt terminations. It's easy to distinguish `axios` from browser so I considered moving to Playwright until I discovered Algolia API https://hn.algolia.com/api.
137+
Initially app started with scraping original https://news.ycombinator.com thread pages and inside the [docs/old-code/scraper](docs/old-code/scraper) folder there is still complete scraper implementation. It uses Axios for fetching html but soon I discovered that their server has very strict rate limiting and such scrapper was getting a lot of unpredictable `ETIMEDOUT` and `ECONNRESET` connection abrupt terminations. It's easy to distinguish Axios from browser so I considered moving to Playwright until I discovered Algolia API https://hn.algolia.com/api.
138138

139-
As stated in docs it has generous rate limiting plan 10000 requests/hour which is 2.77 requests/second. Although in practice I experienced more strict rate limits than this, it is still more than enough to easily collect enough data. Especially the fact that you can receive up to 1000 items per a single request. Meaning pagination is almost not needed, although it is implemented. "Who is hiring" thread have up to 700 comments, entire thread can be fetched with a single API request.
139+
As stated in docs it has generous rate limiting plan `10000 requests/hour` which is `2.77 requests/second`. Although in practice I experienced more strict rate limits than this, it is still more than enough to easily collect enough data. Especially the fact that you can receive up to 1000 items per a single request. Meaning pagination is almost not needed, although it is implemented. "Who is hiring" threads have up to 700 comments, entire thread can be fetched with a single API request.
140140

141-
Company names are extracted with a simple `|` character Regex from a comment title, this posting convention was enforced around the year 2015, so the database is seeded starting from `'2015-06'` month. You can observe all of this in [constants/algolia.ts](constants/algolia.ts).
141+
Company names are extracted with a simple `|` character Regex from a comment title, this posting convention was enforced around the year 2015, so the database is seeded starting from `'2015-06'` month. You can see all of this in [constants/algolia.ts](constants/algolia.ts).
142142

143143
### Database
144144

145145
Threads are modeled with `Thread` and `Comment` models, although they are named `Month` and `Company` because those words make more sense in context of app logic. `Company` table is both `Company` and `Comment` (job ad) at same time, that is why for example self join is used to extract all ads for a company. `Month` primary key is `name` string in `'YYYY-MM'` format which is sortable.
146146

147-
Important implementation detail is that database connection is done as Singleton factory function and this way removed from the global scope. This allows the Next.js app to be built without requiring database connection at build time which significantly simplifies the build process. See this [Github discussion](https://github.com/vercel/next.js/discussions/35534#discussioncomment-11385544). The same Singleton factory is reused for both `Keyv` and `axios` instances [utils/singleton.ts](utils/singleton.ts).
147+
Important implementation detail is that database connection is done as Singleton factory function and this way removed from the global scope. This allows the Next.js app to be built without requiring database connection at build time which significantly simplifies the build process. See this [Github discussion](https://github.com/vercel/next.js/discussions/35534#discussioncomment-11385544). The same Singleton factory is reused for both Keyv and Axios instances [utils/singleton.ts](utils/singleton.ts).
148148

149149
You can check schema in [modules/database/schema.ts](modules/database/schema.ts).
150150

@@ -158,13 +158,13 @@ There is a number of select queries inside [modules/database/select](modules/dat
158158

159159
### Caching
160160

161-
Like stated in the previous section, database is dominantly read only which allowed for easy caching of query responses. This improved SSR pages loading performance from `1200ms` to `400ms`. There is just a single invalidation event, when new month is parsed, you can see it in [modules/parser/calls.ts](modules/parser/calls.ts).
161+
Like stated in the previous section, database is dominantly read only which allowed for easy caching of query responses. This improved SSR pages loading performance from `1200 ms` to `400 ms`. There is just a single invalidation event, when new month is parsed, you can see it in [modules/parser/calls.ts](modules/parser/calls.ts).
162162

163-
`Keyv` library with `KeyvFile` is used for caching both http requests and database queries into `.json` files. Invalidating cache entry does not seem to remove it from the `.json` file to save space, this requires more research and it is added to [Todo](#todo) list.
163+
Keyv library with KeyvFile is used for caching both http requests and database queries into `.json` files. Invalidating cache entry does not seem to remove it from the `.json` file to save space, this requires more research and it is added to [Todo](#todo) list.
164164

165165
Inside the [libs/keyv.ts](libs/keyv.ts) there is a `cacheDatabaseWrapper()` function that accepts database query function and returns cached version.
166166

167-
`Keyv` instances are also removed from the global scope with Singleton factory function.
167+
Keyv instances are also removed from the global scope with Singleton factory function.
168168

169169
### Scheduler
170170

@@ -174,7 +174,7 @@ Scheduler is also useful to seed the database for the entire history and avoid A
174174

175175
Initial idea was to add cron task inside the Docker image itself. After careful research I discovered that `crond` daemon must run as `root` user or it will produce `setpgid: Operation not permitted` error, see this [Github issue](https://github.com/gliderlabs/docker-alpine/issues/381#issuecomment-621946699). This was unacceptable because Next.js app needs to run as `non-root` user for security reasons and easier managing file permissions in Docker bind mount volumes (database, cache, log files).
176176

177-
This requires exposing scripts as API endpoints and running `crond` daemon in a separate Docker image, which is unpractical and greatly complicates deployment. There are 3rd party binaries to run scheduled tasks in Docker, mostly in Go https://github.com/aptible/supercronic.
177+
This requires exposing scripts as API endpoints and running `crond` daemon in a separate Docker image, which is unpractical and greatly complicates deployment. There are 3rd party binaries to run scheduled tasks in Docker, mostly in Go [aptible/supercronic](https://github.com/aptible/supercronic).
178178

179179
Fortunately there are also few Node.js packages to run scheduled tasks, I picked [node-cron/node-cron](https://github.com/node-cron/node-cron) for simplicity and practical reasons.
180180

@@ -193,8 +193,8 @@ Scripts itself are also exposed (and left unused) as API endpoints in [app/api/p
193193
## Todo
194194

195195
- Handle not found exceptions in database select queries
196+
- Clear Keyv cache files, not just invalidate
196197
- Winston rotate single file
197-
- Clear `Keyv` cache files, not just invalidate
198198

199199
## References
200200

0 commit comments

Comments
 (0)