Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applied rateLimit doesn't work well for file policies (e.g., security_file_permission) #1918

Closed
vladimirkus opened this issue Dec 28, 2023 · 11 comments
Assignees
Labels
kind/enhancement This improves or streamlines existing functionality

Comments

@vladimirkus
Copy link

What happened?

Applied rateLimit according to the documentation. Here is the whole tracing policy:

spec:
  kprobes:
  - args:
    - index: 0
      type: file
    - index: 1
      type: int
    call: security_file_permission
    return: true
    returnArg:
      index: 0
      maxData: false
      returnCopy: false
      type: int
    selectors:
    - matchActions:
      - action: Post
        rateLimit: 1m
      matchArgs:
      - index: 0
        operator: Prefix
        values:
        - /etc/passwd
      - index: 1
        operator: Equal
        values:
        - "4"
    syscall: false

As far as I understood from the documentation this may significantly decrease count of repetitive events, however this didn't have any noticeable effect.

During the experiment I generated events with while true; do cat /etc/passwd > /dev/null; done.

Are there any constraints using rate limiting, or probably I misunderstood the concept?

Tetragon Version

1.0.1

Kernel Version

5.15

Kubernetes Version

1.24

Bugtool

No response

Relevant log output

No response

Anything else?

No response

@vladimirkus vladimirkus added the kind/bug Something isn't working label Dec 28, 2023
@kevsecurity
Copy link
Contributor

Hello

Rate limiting is per thread and your test involves starting numerous 'cat' processes one after the other, each with a different thread ID. As such, no rate limiting occurs. The idea is to limit a single thread's repeated events, not repeated events generally.

I turned the rateLimit value in the policy down to 3s for testing, and used the following C program to generate tests:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
        int fd;
        int count = 1;
        char buf[65536];
        struct stat st;

        while (1) {
                fd = open("/etc/passwd", O_RDONLY);
                if (fd < 0) {
                        printf("open failed: %s\n", strerror(errno));
                        exit(1);
                }
                if (fstat(fd, &st) < 0) {
                        printf("fstat failed: %s\n", strerror(errno));
                        exit(1);
                }
                if (read(fd, buf, st.st_size) <= 0) {
                        printf("read failed: %s\n", strerror(errno));
                        exit(1);
                }
                printf("Opened /etc/passwd: %7d\n", count);
                count++;
                close(fd);
                sleep(1);
        }
        return 0;
}

It wouldn't be difficult to add a switch to the rate limiting function to make it apply to all threads. Let me know if that would be helpful.

@kevsecurity
Copy link
Contributor

I've updated the docs in this PR. #1961

@kevsecurity
Copy link
Contributor

Also, I made this PR that lets you change the scope of the rate limiting from thread to "process" or "global". LMK if it helps in your case. Thanks!
#1962

@kevsecurity
Copy link
Contributor

These have been merged. They will be available in the next release. Otherwise, take a ci:latest tag if you want to try them out.

@vladimirkus
Copy link
Author

Hello and thank you, it should be very helpful as in our case. I'm going to test it with the ci tag.

@vladimirkus
Copy link
Author

I've tested it with the same configuration, just adding rateLimitScope: "global" and rateLimit: "10s"
Rate limiting itself works as excpected! Though, there are a few question that I would like to ask.

  • I've noticed a different CPU and RAM usage profile for the tetragon pod. With rate limiting enabled I noticed ~30% increase in RAM usage and spikes on the CPU, up to 60% of one core, which is very unusual. Is this expected?
  • It's written in the doc that only 40 first bytes of each argument are used in matching. Doest it mean that in this case, if I have multiple independent events under some long filepath, some of them may be lost due to this limitation? Is it possible to increase this limitation?

@kevsecurity
Copy link
Contributor

Thanks for testing. Rate limiting itself shouldn't be expensive, but maybe there were other changes between 1.0.0 and the ci:latest that have impacted RAM and CPU. I'm less concerned about the spikes unless they add up to a problem. Would you have a reproducer that we could try?

Regarding the numbers of bytes to use in the rate limiting, this is a balance between RAM/CPU and effectiveness. As the size increases, so does the size of the key that is used to store rate limiting info (hence RAM) and as the key is hashed, this increases the work required to look up the info (hence CPU). In the cases where rate limiting isn't indicated by the info, this ends up as a cost with little benefit; of course, where rate limiting is employed it can save CPU on processing multiple identical events in user space (including the task switching from kernel to user space), hence the trade off.

You are correct that only the first 40 bytes of paths will be compared, leading to the potential for lost info. I would recommend not using rate limiting on these kinds of arguments/hooks. It was originally intended for networking operations where 40 bytes is the size of an IPv6 tuple, but can be used anywhere where arguments would be unique within the first 40 bytes. We could make an issue to make this configurable, and potentially to specify to use the end of the argument rather than the beginning (but I think that just moves the problem). How much of a problem do you think it is?

Thanks

@vladimirkus
Copy link
Author

vladimirkus commented Jan 18, 2024

The test that I did was the same, I run
while true; do cat /etc/passwd > /dev/null; done
and
while true; do cat /etc/passwd > /dev/null; sleep 1; done
With and without rateLimit with the policy from the topic.
I understand that this test doesn't reflect the real workloads' behavior, and I would like to test it on some real cluster to be sure if there is any issue.

Regarding the lengh limit, I don't really think that this is a big issue as I agree with you. While testing this I understood that this functionality is not the best fit to filesystem events, as it was initially developed for fix-sized arguments as IP addresses or so, hence the best fit are socket events.

In our environment, however, the most volume of events is generated by security_file_permission, and we have an intention to lower the volume, but I see that it's better to think about some other approach.

@kkourt kkourt added kind/enhancement This improves or streamlines existing functionality and removed kind/bug Something isn't working labels Feb 19, 2024
@kkourt kkourt changed the title Applied rateLimit doesn't work Applied rateLimit doesn't work well for file policies (e.g., security_file_permission) Feb 19, 2024
@kkourt
Copy link
Contributor

kkourt commented Feb 19, 2024

@vladimirkus based on the discussion, I've modified the title and the labels of the issue to be an enhancement rather than a bug. We could also create a new summary if that's better.

@kevsecurity
Copy link
Contributor

  • I've noticed a different CPU and RAM usage profile for the tetragon pod. With rate limiting enabled I noticed ~30% increase in RAM usage and spikes on the CPU, up to 60% of one core, which is very unusual. Is this expected?

@vladimirkus I finally found time to do some tests. On this test system (kernel 5.15, 8G RAM, x64 Kernel VM), I found the memory usage was pretty consistent between the rateLimit and no-rateLimit scenarios (~14.5% for the first test for both, and ~1.3% for the second test for both); and I found the CPU spikes in both rateLimit and non-rateLimit scenarios for the first test (24% CPU -> 66% for rateLimit and 40% -> 92% for non-rateLimit) but no CPU spikes for the second test.

I like that the rateLimit scenario had a lower overall CPU profile (as predicted) and assume that the CPU spikes are related to Tetragon threads waking up and handling events.

If you're happy, I think we could close this issue.

@vladimirkus
Copy link
Author

@kevsecurity
Hello and sorry for the long pause. I had a chance to test it more on our production clusters using v1.1.0 and I can confirm the memory and cpu usage are pretty consistent this time, probably with a small overhead. And functionality works as expected. I agree that the issue should be closed, thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This improves or streamlines existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants