-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mov -, mutex_aqc; compiles to .never #30
Comments
Hm, you say that the missing assignment to the nop register prevents the mutex from working correctly? AFAIK the I/O registers do not care about the condition bits at all it would be really helpful if they do. Mostly |
Sorry for the late reply. Still figuring out what's wrong and what not. I'm not entirely sure if the mutex works at all. If it does, then terribly unreliably. I have several QPUs processing a camera frame in tiles, each reading using TMU and writing using their own dedicated VPM space. However once I just write an adress to the TMU and ldtmu0, it doesn't work anymore and shows varying symptoms. Simple nop;s in place of the TMU do not trigger these. Then I tried synchronizing VPM access with the mutex, however this doesn't seem to work too reliably. However, when I lower the framerate, it manages to stay alive considerably longer. And when I protect not each individual VPM write, but a whole block or line, effectively decreasing the rate the mutex is accessed, then it also stays alive longer. Will have to do some more tests it seems... |
I did some tests with mutex access. It turned out that neither conditional assignment nor really reading the value has any impact. So I see no reason why the mentioned optimization should not be applied. |
Sorry for not coming back to this, worked on other parts of my project. It seems my problem is a different then, however I'm still confused how, if the mutex is aquired and released, some of my code breaks. E.g. I aquire the mutex at program start and release it at program end - shouldn't that be completely equivalent to running these programs right one after another? However even this simple change causes the program to start overwriting the physical memory after a few executions, making me believe the mutex acquisition is either not reliable or does not work at all. |
Adding TMU has two effects:
I am not sure whether this is really related to the TMU queue depth but at least it is much more reliable when no more than 4 TMU load cycles are queued. Increasing the TMU cache hit rate (if possible) also decreases the probability of faults, probably because of the reduced memory pressure. |
By the way: I just had a look at your qpu_blit_tiled code. You only use mutex acquire, no release at all. I wonder if this works at all. But maybe the mutex is recursive and thrend implies release. - Just tested: the mutex is recursive. But it does not count the number of acquires. A recursive acquire is just a nop. Furthermore any of the QPUs that completed their code first will raise the host interrupt causing the firmware to think that all code has completed, regardless if other QPUs are still working. Note that this will not stop the other QPUs from proceeding. And if you reuse the memory fastly you will get serious race conditions. |
Thanks for taking your time trying to help me out, it's greatly appreciated. I temporarily removed the cache clearing, however the camera frames are still too large unfortunately so the cache is completely thrashed every frame, barely reducing the load. Debug verifies that the only cache hits are the program code (and in certain circumstances uniforms). During some tests, I also checked the stall rate during normal operation to see if the there were any anomalies. Compared to the blit_full programs, where I calculate about a 25% stall rate (assuming my debug calculations are correct), the blit_tiled reduces the stall rate to about half that, 13% with three programs. Both waiting for completion right after I set the adress - could this be a problem? So I made some changes to make debugging easier and did a lot of tests over the weekend. |
So I had a problem with synchronized VPM access for a while, and it turns out the assembler reduces the following instruction into something - for me - unexpected:
mov -, mutex_acq;
orread mutex_acq;
(with accompanying instruction)Expected action: aquire the mutex (read from mutex_acq)
Actually generated code:
or.never nop, mutex_acq, nop;
Expected code that does acquire the mutex:
or nop, mutex_acq, nop;
Workaround (setf is required for it to not automatically set .never flag):
or.setf nop, mutex_acq, nop;
I presume for other signaling reads the .never signal is correct and a valid optimisation to save ALU power, but for the mutex it does not seem work.
The text was updated successfully, but these errors were encountered: