Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/anubis: Some ideas (support non JS users, forward auth and blocking known bots) #651

Open
unixfox opened this issue Jan 20, 2025 · 5 comments
Labels
anubis Bugs involving Anubis

Comments

@unixfox
Copy link

unixfox commented Jan 20, 2025

Hello,

First I would like to emphasize that I have been working on the same path as anubis on my public service (https://xcancel.com), a project named antibot-proxy and later on giving the ideas for the antibot in SearXNG. And I have been interested in anti bot solutions for 8 years.

1st idea: Fallback for non JS users

Most antibot solution fails to cover the fact that a minority of Internet users have JS disabled. For whatever reason some users are disabling JS, I won't go into details here, if you want to understand more type in your search engine "why disabling javascript privacy".

Unfortunately, this is the case of anubis. It's sad to straight block them, at least giving them a "fallback" choice would be a good idea.

There are multiples ways to achieve that:

Easy ones to implement

Harder ones to implement

  • Detect if it's a browser by using some CSS properties. Example 1, Example 2, Example 3, Example 4 and Example 5
    CSS tracking stuff is just for demo, can reuse the CSS properties in a good way for just browser checking and not for fingerprinting the users.

And some quirky ones

2nd idea: Forward auth

At the current state, anubis acts like the central hub between the reverse proxy and the final app. It has to process all the whole HTTP packets transmitted between the reverse proxy and the final app.

This has some direct effects: less efficient than a reverse proxy, increase in internal bandwidth usage, increase in CPU usage, increase in memory usage, slows down a bit the user experience since the HTTP packets have to be processed by 3 applications.

In most major reverse proxy NGINX, Caddy, Traefik. You have what's called a forward auth plugin.

When the reverse proxy receives an HTTP request, it sends the HTTP headers and the path to an external application for what to do, blocking the request or passing the request to the final backend.

This has the benefit of not having the entire HTTP request being transmitted through the external app, and you can keep using all the bells and whistles of your preferred reverse proxy (blue-green scenario, canary scenario, load balancing and so on).

anubis could be modified to be used this way instead. The verification page would be on a separate path, to which the user is redirected to when anubis has never seen him. The browser do the JS challenge and if it passes, then it is redirected to the original page.

The downside is that this might cause some issue with POST requests, but I don't have a solution for that right now. They are probably some solutions for that.

On the upside, WebSockets should work. And there is no need to rewrite anubis in another programming language because NGINX will do all the heavy lifting.

3rd idea: Use TLS/HTTP2 fingerprint to block known bots using automated libraries (go, python, ...) and faking their user agent to pretend to be a browser

Here is an explanation on how TLS fingerprinting works: https://engineering.salesforce.com/tls-fingerprinting-with-ja3-and-ja3s-247362855967/ and https://fingerprint.com/blog/what-is-tls-fingerprinting-transport-layer-security/

There are not that many public JA3 databases but since the creation of JA4, there is a public one available: https://ja4db.com/

You can also validate if it's a real browser based on their HTTP2 fingerprint. There are no known databases for that, but usually it's a fixed fingerprint for all Chrome browsers of a specific version range or all Firefox browsers of a specific version range.

HTTP2 and JA3/JA4 fingerprinting require using a modified reverse proxy: JA4 see Tools that support JA4+, JA3 NGINX, HTTP2 NGINX, JA4 Haproxy, JA3 haproxy


I know it's a long issue, but feel free to split it into smaller issues if you want to implement some ideas with what solution you prefer.

I can provide more answer if you didn't understand some things that I explained here.

It's a summary of everything I have noted over the years if I still had time to create an open source solution that is a good alternative to using something like Cloudflare but better.

@unixfox unixfox changed the title cmd/anubis: Some ideas (support non JS users) cmd/anubis: Some ideas (support non JS users, forward auth and blocking known bots) Jan 20, 2025
@Xe Xe added the anubis Bugs involving Anubis label Jan 20, 2025
@Xe
Copy link
Owner

Xe commented Jan 20, 2025

Hey, thank you for reporting your experiences and suggesting things. As you said I'm going to have to carve this up into a few sub-issues or something. This project is still very young, it'll get figured out.

I put the idea of a non-JS test in #652. I'm personally loathe to do this, but I understand why such a thing would be wanted. I used to be one of those people that blocked all client-side JS in their browser, but I ended up having to give up because my work time was mostly spent adding exceptions to the NoScript extension. I think that in the meantime, telling clients they must load first-party JavaScript is sufficient. It sucks, but that's 2025 in a nutshell.

One of the more compounding factors with this is that there are reasons why admins would want to support some automated clients (such as git clients, API tooling, build automation such as Jenkins, etc). This is why the main test for showing a challenge page is if the User-Agent string contains "Mozilla". It's a bit galaxy brain, but it works way better than it has any right to.

I want to make that logic configurable by admins at some point, that may require pulling in a JS/Lua interpreter or something. I'm still thinking out the details, but it will get figured out.

Forward auth support is currently tracked in #647, assuming all of these programs support the same auth-request format, it shouldn't be that hard to do. I will need to think about it a bit more in order to avoid making things too complicated.

Inspecting the TLS fingerprint of a client may be out of scope, Anubis is intended to be a middleware that works with your existing TLS terminator, not something that terminates TLS itself. The biggest intent is for this to be slotted into existing setups (this is why it functions as a HTTP reverse proxy, the closest thing that the HTTP protocol has to language-agnostic middleware).

Again though, thank you a lot for your input. I'm glad that this 8 hours of caffeine-fueled spite against abusive scrapers is doing good. There's certainly more complicated or elaborate checks that can be done, but I'm glad that this proof of concept grade implementation works. It certainly works a lot better than I expected.

@unixfox
Copy link
Author

unixfox commented Jan 20, 2025

Thanks for reading. Do not feel pressurized to implement everything or even anything that I said, it's your project after all. Like I said at the end of my comment, that's a lot of interesting ideas if someone were to design an anti bot protection somewhat similar to Cloudflare but better for the internet users.

Here is my answer to your comments 😃:

I put the idea of a non-JS test in #652

Personally, I got afraid of having JS always enabled after reading all the post from fingerprint.com and trying it out an open source look alike solution named creepjs 🤯. But I fully understand too that in 2025 it's hard to keep JS disabled.

This is why the main test for showing a challenge page is if the User-Agent string contains "Mozilla".

Actually, that's very clever from you and I vouch the idea! All of my suggestions were with this idea in mind, implement things that are only triggered on the User-Agent "Mozilla".

Inspecting the TLS fingerprint of a client may be out of scope, Anubis is intended to be a middleware that works with your existing TLS terminator, not something that terminates TLS itself.

You don't need to have anubis being the TLS terminator, the hash can be sent by your reverse proxy using an HTTP header. I was giving this idea because it's actually a very effective way to detect HTTP clients that spoof their user agent to a browser (Mozilla XXX).

It's like a kid showing the ID of his big brother to the bouncer in order to pretend to be an adult, but it's written "I'm 15" on his forehead.

While it requires a bit of modification in the setup, it is very effective and use very little resource because you just have to compare the user agent and the hash in the database.

@Xe
Copy link
Owner

Xe commented Jan 20, 2025

You don't need to have anubis being the TLS terminator, the hash can be sent by your reverse proxy using an HTTP header.

Do you have an example of this with nginx?

@unixfox
Copy link
Author

unixfox commented Jan 20, 2025

You don't need to have anubis being the TLS terminator, the hash can be sent by your reverse proxy using an HTTP header.

Do you have an example of this with nginx?

Like I said at the end of my first message, it's not available out of the box in NGINX, you need to either recompile NGINX or use a soft that support it out of the box (like HAProxy).

But FoxIO (the company behind the new standard JA4) provide a precompiled docker image: https://github.com/FoxIO-LLC/ja4-nginx-module?tab=readme-ov-file#docker

With their image, you only need to have a config like this:

server {
   location / {
       add_header x-ja4-hash $http_ssl_ja4;
       proxy_pass http://localhost:8888;
   }
}

Then, from the app running at port 8888, you can access the hash from the header "x-ja4-hash".

Or if you want to instead use JA3 (old standard of JA4 but still great), there is a docker-compose.yml here: https://github.com/fooinha/nginx-ssl-ja3/blob/master/docker/docker-compose.yml

@unixfox
Copy link
Author

unixfox commented Mar 14, 2025

Hey, I just wanted to get back to you with an interesting solution. I have been experimenting with HTTP2 fingerprinting, paper is here: https://www.blackhat.com/docs/eu-17/materials/eu-17-Shuster-Passive-Fingerprinting-Of-HTTP2-Clients-wp.pdf

When applying this paper on my webserver and looking at the OpenAI user agents, I literally found the exact HTTP2 fingerprint of OpenAI bot. It's:

  • 2:0;3:1;4:8388608;5:65536|8323073|1:0:0:16|m,s,a,p
  • 2:0;4:2097152;5:16384;6:16384|5177345|1:0:0:16|m,s,a,p
  • 2:0;1:4096;8:0;3:2147483647;4:1048576|268369921|1:0:0:16|a,p,m,s

I'm not joking, these identifiers will detect most scrapers by OpenAI whatever the IP or the user agent used. Since it's early in the HTTP packet, it can easily be blocked without inducing any load whatsoever.

Here are the user agents for 2:0;4:2097152;5:16384;6:16384|5177345|1:0:0:16|m,s,a,p (over 3 days):

Image

Could technically be applied to other AI scrapers. Well as long as they are loading the content using HTTP2, but I guess it's also possible to block them using JA4, on https://ja4db.com I found the JA4 for Amazonbot, claude, PerplexityBot

@Xe Xe added this to Anubis Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
anubis Bugs involving Anubis
Projects
Status: No status
Development

No branches or pull requests

2 participants