-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/anubis: Some ideas (support non JS users, forward auth and blocking known bots) #651
Comments
Hey, thank you for reporting your experiences and suggesting things. As you said I'm going to have to carve this up into a few sub-issues or something. This project is still very young, it'll get figured out. I put the idea of a non-JS test in #652. I'm personally loathe to do this, but I understand why such a thing would be wanted. I used to be one of those people that blocked all client-side JS in their browser, but I ended up having to give up because my work time was mostly spent adding exceptions to the NoScript extension. I think that in the meantime, telling clients they must load first-party JavaScript is sufficient. It sucks, but that's 2025 in a nutshell. One of the more compounding factors with this is that there are reasons why admins would want to support some automated clients (such as git clients, API tooling, build automation such as Jenkins, etc). This is why the main test for showing a challenge page is if the User-Agent string contains "Mozilla". It's a bit galaxy brain, but it works way better than it has any right to. I want to make that logic configurable by admins at some point, that may require pulling in a JS/Lua interpreter or something. I'm still thinking out the details, but it will get figured out. Forward auth support is currently tracked in #647, assuming all of these programs support the same auth-request format, it shouldn't be that hard to do. I will need to think about it a bit more in order to avoid making things too complicated. Inspecting the TLS fingerprint of a client may be out of scope, Anubis is intended to be a middleware that works with your existing TLS terminator, not something that terminates TLS itself. The biggest intent is for this to be slotted into existing setups (this is why it functions as a HTTP reverse proxy, the closest thing that the HTTP protocol has to language-agnostic middleware). Again though, thank you a lot for your input. I'm glad that this 8 hours of caffeine-fueled spite against abusive scrapers is doing good. There's certainly more complicated or elaborate checks that can be done, but I'm glad that this proof of concept grade implementation works. It certainly works a lot better than I expected. |
Thanks for reading. Do not feel pressurized to implement everything or even anything that I said, it's your project after all. Like I said at the end of my comment, that's a lot of interesting ideas if someone were to design an anti bot protection somewhat similar to Cloudflare but better for the internet users. Here is my answer to your comments 😃:
Personally, I got afraid of having JS always enabled after reading all the post from fingerprint.com and trying it out an open source look alike solution named creepjs 🤯. But I fully understand too that in 2025 it's hard to keep JS disabled.
Actually, that's very clever from you and I vouch the idea! All of my suggestions were with this idea in mind, implement things that are only triggered on the User-Agent "Mozilla".
You don't need to have anubis being the TLS terminator, the hash can be sent by your reverse proxy using an HTTP header. I was giving this idea because it's actually a very effective way to detect HTTP clients that spoof their user agent to a browser (Mozilla XXX). It's like a kid showing the ID of his big brother to the bouncer in order to pretend to be an adult, but it's written "I'm 15" on his forehead. While it requires a bit of modification in the setup, it is very effective and use very little resource because you just have to compare the user agent and the hash in the database. |
Do you have an example of this with nginx? |
Like I said at the end of my first message, it's not available out of the box in NGINX, you need to either recompile NGINX or use a soft that support it out of the box (like HAProxy). But FoxIO (the company behind the new standard JA4) provide a precompiled docker image: https://github.com/FoxIO-LLC/ja4-nginx-module?tab=readme-ov-file#docker With their image, you only need to have a config like this:
Then, from the app running at port 8888, you can access the hash from the header "x-ja4-hash". Or if you want to instead use JA3 (old standard of JA4 but still great), there is a docker-compose.yml here: https://github.com/fooinha/nginx-ssl-ja3/blob/master/docker/docker-compose.yml |
Hey, I just wanted to get back to you with an interesting solution. I have been experimenting with HTTP2 fingerprinting, paper is here: https://www.blackhat.com/docs/eu-17/materials/eu-17-Shuster-Passive-Fingerprinting-Of-HTTP2-Clients-wp.pdf When applying this paper on my webserver and looking at the OpenAI user agents, I literally found the exact HTTP2 fingerprint of OpenAI bot. It's:
I'm not joking, these identifiers will detect most scrapers by OpenAI whatever the IP or the user agent used. Since it's early in the HTTP packet, it can easily be blocked without inducing any load whatsoever. Here are the user agents for Could technically be applied to other AI scrapers. Well as long as they are loading the content using HTTP2, but I guess it's also possible to block them using JA4, on https://ja4db.com I found the JA4 for Amazonbot, claude, PerplexityBot |
Hello,
First I would like to emphasize that I have been working on the same path as anubis on my public service (https://xcancel.com), a project named antibot-proxy and later on giving the ideas for the antibot in SearXNG. And I have been interested in anti bot solutions for 8 years.
1st idea: Fallback for non JS users
Most antibot solution fails to cover the fact that a minority of Internet users have JS disabled. For whatever reason some users are disabling JS, I won't go into details here, if you want to understand more type in your search engine "why disabling javascript privacy".
Unfortunately, this is the case of anubis. It's sad to straight block them, at least giving them a "fallback" choice would be a good idea.
There are multiples ways to achieve that:
Easy ones to implement
Your CAPTCHA can be a simple image, or an image generated from CSS (harder to extract for using in anti CAPTCHA solution)
Harder ones to implement
CSS tracking stuff is just for demo, can reuse the CSS properties in a good way for just browser checking and not for fingerprinting the users.
And some quirky ones
2nd idea: Forward auth
At the current state, anubis acts like the central hub between the reverse proxy and the final app. It has to process all the whole HTTP packets transmitted between the reverse proxy and the final app.
This has some direct effects: less efficient than a reverse proxy, increase in internal bandwidth usage, increase in CPU usage, increase in memory usage, slows down a bit the user experience since the HTTP packets have to be processed by 3 applications.
In most major reverse proxy NGINX, Caddy, Traefik. You have what's called a forward auth plugin.
When the reverse proxy receives an HTTP request, it sends the HTTP headers and the path to an external application for what to do, blocking the request or passing the request to the final backend.
This has the benefit of not having the entire HTTP request being transmitted through the external app, and you can keep using all the bells and whistles of your preferred reverse proxy (blue-green scenario, canary scenario, load balancing and so on).
anubis could be modified to be used this way instead. The verification page would be on a separate path, to which the user is redirected to when anubis has never seen him. The browser do the JS challenge and if it passes, then it is redirected to the original page.
The downside is that this might cause some issue with POST requests, but I don't have a solution for that right now. They are probably some solutions for that.
On the upside, WebSockets should work. And there is no need to rewrite anubis in another programming language because NGINX will do all the heavy lifting.
3rd idea: Use TLS/HTTP2 fingerprint to block known bots using automated libraries (go, python, ...) and faking their user agent to pretend to be a browser
Here is an explanation on how TLS fingerprinting works: https://engineering.salesforce.com/tls-fingerprinting-with-ja3-and-ja3s-247362855967/ and https://fingerprint.com/blog/what-is-tls-fingerprinting-transport-layer-security/
There are not that many public JA3 databases but since the creation of JA4, there is a public one available: https://ja4db.com/
You can also validate if it's a real browser based on their HTTP2 fingerprint. There are no known databases for that, but usually it's a fixed fingerprint for all Chrome browsers of a specific version range or all Firefox browsers of a specific version range.
HTTP2 and JA3/JA4 fingerprinting require using a modified reverse proxy: JA4 see Tools that support JA4+, JA3 NGINX, HTTP2 NGINX, JA4 Haproxy, JA3 haproxy
I know it's a long issue, but feel free to split it into smaller issues if you want to implement some ideas with what solution you prefer.
I can provide more answer if you didn't understand some things that I explained here.
It's a summary of everything I have noted over the years if I still had time to create an open source solution that is a good alternative to using something like Cloudflare but better.
The text was updated successfully, but these errors were encountered: