Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mov -, mutex_aqc; compiles to .never #30

Closed
Seneral opened this issue Apr 29, 2020 · 7 comments
Closed

mov -, mutex_aqc; compiles to .never #30

Seneral opened this issue Apr 29, 2020 · 7 comments

Comments

@Seneral
Copy link

Seneral commented Apr 29, 2020

So I had a problem with synchronized VPM access for a while, and it turns out the assembler reduces the following instruction into something - for me - unexpected:
mov -, mutex_acq; or read mutex_acq; (with accompanying instruction)
Expected action: aquire the mutex (read from mutex_acq)
Actually generated code:
or.never nop, mutex_acq, nop;
Expected code that does acquire the mutex:
or nop, mutex_acq, nop;
Workaround (setf is required for it to not automatically set .never flag):
or.setf nop, mutex_acq, nop;

I presume for other signaling reads the .never signal is correct and a valid optimisation to save ALU power, but for the mutex it does not seem work.

@maazl
Copy link
Owner

maazl commented Apr 30, 2020

Hm, you say that the missing assignment to the nop register prevents the mutex from working correctly? AFAIK the I/O registers do not care about the condition bits at all it would be really helpful if they do. Mostly read IOREG is sufficient to trigger the logic. This does not even allocate an ALU.
If you can confirm that mutex_acq can be masked by a condition, preferable not only .never, we should add this to the documentation and, of course, vc4asm should not modify conditions of I/O register access.
By the way, why did you get no deadlocks when the mutex access is completely eliminated?

@Seneral
Copy link
Author

Seneral commented May 2, 2020

Sorry for the late reply. Still figuring out what's wrong and what not. I'm not entirely sure if the mutex works at all. If it does, then terribly unreliably. I have several QPUs processing a camera frame in tiles, each reading using TMU and writing using their own dedicated VPM space.
The program itself with TMU and everything works flawlessly (disregarding any synchronization, tested by executing the tiles one after another, CPU-side).
Just writing to the VPM I can do fine without problems and even any mutex at all, even when all QPUs are executing at the same time.

However once I just write an adress to the TMU and ldtmu0, it doesn't work anymore and shows varying symptoms. Simple nop;s in place of the TMU do not trigger these.

Then I tried synchronizing VPM access with the mutex, however this doesn't seem to work too reliably.
In certain constellation it works for a few seconds (a few dozen frames) and then just stalls.
In other constellations it just starts overwriting the whole memory, without apparent reason.

However, when I lower the framerate, it manages to stay alive considerably longer. And when I protect not each individual VPM write, but a whole block or line, effectively decreasing the rate the mutex is accessed, then it also stays alive longer.

Will have to do some more tests it seems...

@maazl
Copy link
Owner

maazl commented Jun 12, 2020

I did some tests with mutex access. It turned out that neither conditional assignment nor really reading the value has any impact. read mutex reliably acquires the mutex, and mov.never mutex, 0 reliably releases the mutex. There is no need even to use the value read from mutex at all.

So I see no reason why the mentioned optimization should not be applied.

@maazl maazl closed this as completed Jun 12, 2020
@Seneral
Copy link
Author

Seneral commented Jun 12, 2020

Sorry for not coming back to this, worked on other parts of my project. It seems my problem is a different then, however I'm still confused how, if the mutex is aquired and released, some of my code breaks. E.g. I aquire the mutex at program start and release it at program end - shouldn't that be completely equivalent to running these programs right one after another? However even this simple change causes the program to start overwriting the physical memory after a few executions, making me believe the mutex acquisition is either not reliable or does not work at all.
Theres more to it, different variations of this break in slightly different ways, and I've written it down here incase you are interested. I believe this might be an issue of mutex access frequency. I've gotten it to work on very low frequencies (10Hz with 6 concurrent programs) but it reliably breaks at even 40Hz (again with 6 programs).
However that did not make it clear to me what exactly is the problem either.

@maazl
Copy link
Owner

maazl commented Jun 12, 2020

Adding TMU has two effects:

  1. it adds TMU stalls, i.e. sheduling will more often switch between QPUs. Although the QPUs are independent they are usually limited by the (common) memory bandwidth. So sequence of execution may significantly change.
  2. the TMU may create a significant memory load. Did you use the full 8 slots pipeline depth of the TMU? I never got this reliably working. The TMU sometimes reads rubbish in this case. If you read target addresses or loop counts by the TMU the result can be largely undefined. Sometimes this heavy load even causes the VideoCore to crash completely.

I am not sure whether this is really related to the TMU queue depth but at least it is much more reliable when no more than 4 TMU load cycles are queued. Increasing the TMU cache hit rate (if possible) also decreases the probability of faults, probably because of the reduced memory pressure.
With respect to this it could make a significant difference in which order the memory items are fetched. The TMU of a QPU fetches them always exactly in order, i.e. first element 0 then element 1 ... up to element 15. If one of theses accesses causes aliasing in the non-associative TMU cache this could significantly downgrade performance and increase memory pressure.

@maazl
Copy link
Owner

maazl commented Jun 12, 2020

By the way: I just had a look at your qpu_blit_tiled code. You only use mutex acquire, no release at all. I wonder if this works at all. But maybe the mutex is recursive and thrend implies release. - Just tested: the mutex is recursive. But it does not count the number of acquires. A recursive acquire is just a nop.

Furthermore any of the QPUs that completed their code first will raise the host interrupt causing the firmware to think that all code has completed, regardless if other QPUs are still working. Note that this will not stop the other QPUs from proceeding. And if you reuse the memory fastly you will get serious race conditions.

@Seneral
Copy link
Author

Seneral commented Jun 15, 2020

Thanks for taking your time trying to help me out, it's greatly appreciated.
It currently uses only one TMU slot per program and then waits, didn't want to optimize before even getting it to work. I do release the mutex - they are marked with the block comment (e.g. on line 140).
On the host side I wait until all QPUs finished, I do this by waiting until the appropriate QPU register (V3D_SRQCS) reaches the expected value, indicating all programs finished. After that there should be several milliseconds until the QPU is used again (up to 100ms with 10fps, no difference).

I temporarily removed the cache clearing, however the camera frames are still too large unfortunately so the cache is completely thrashed every frame, barely reducing the load. Debug verifies that the only cache hits are the program code (and in certain circumstances uniforms).

During some tests, I also checked the stall rate during normal operation to see if the there were any anomalies. Compared to the blit_full programs, where I calculate about a 25% stall rate (assuming my debug calculations are correct), the blit_tiled reduces the stall rate to about half that, 13% with three programs. Both waiting for completion right after I set the adress - could this be a problem?
However, my calculations also say that generally is uses less than 1% of the practical TMU bandwidth (assuming every TMU read is a Cache Miss, so ~20cycles). So TMU doesn't seem pressured at all.

So I made some changes to make debugging easier and did a lot of tests over the weekend.
Since this is a whole lot of information I've put it down in it's issue. Feel free to take a look at it, I have no clue why the behaviour is the way it is, I've seen other code working just fine with the information given from the VC4 documentation. I have yet to find the reason for this behaviour, so any input is greatly appreciated. Note that in Code 5, the addition of one nop; where, according to documentation, it should make no difference, creates different behaviour depending on which QPUs the code runs on. By now it's obviously not an issue of the compiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants