-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
esp32s3: UART rx_fifo_count() produces garbage after RxError::FifoOverflowed. #3168
Comments
Thanks! There might be something - especially about the assumed and actual behavior If the hardware gets "out of sync" here it should be reproducible with relatively simple synchronous code I guess but I wasn't able to force this
I haven't looked too much into the code (it's quite a lot) but if the code isn't doing anything special TX and RX are more or less independent. Without any flow-control TX will happily continue sending data - on the RX side the overflow flag will be set but otherwise it will continue to receive data This non-async code shows it received what was sent (minus the overflowed data) //% CHIPS: esp32c2 esp32c3 esp32c6 esp32h2 esp32s2 esp32s3
//% FEATURES: embassy esp-hal/unstable
#![no_std]
#![no_main]
use embedded_io::Write;
use esp_backtrace as _;
use esp_hal::uart::Uart;
use esp_println::println;
#[esp_hal::main]
fn main() -> ! {
esp_println::println!("Init!");
let peripherals = esp_hal::init(esp_hal::Config::default());
let mut uart = Uart::new(peripherals.UART1, esp_hal::uart::Config::default())
.unwrap()
.with_tx(peripherals.GPIO4)
.with_rx(peripherals.GPIO5);
// write s.th. to the buffer to force an rx-overflow
let mut buf = [0;70];
for i in 0..buf.len() { buf[i] = i as u8; }
uart.write_all(&buf).ok();
esp_hal::delay::Delay::new().delay_millis(500);
// write even more w/o reading
let mut buf = [0;70];
for i in 0..buf.len() { buf[i] = i as u8; }
uart.write_all(&buf).ok();
esp_hal::delay::Delay::new().delay_millis(500);
let mut buf = [0u8; 256];
for i in 0..buf.len() {
buf[i] = (i % 255) as u8;
}
loop {
for chunk in buf.chunks(64) {
// read what we can
for _ in 0..1000 {
let mut rbuf = [0u8; 1024];
let res = uart.read_buffered(&mut rbuf);
match res {
Ok(res) => {
if res > 0 {
println!("Received: {:?}", &rbuf[..res]);
}
}
Err(e) => {
println!("{:?}", e);
}
}
}
// (try) write a chunk of data - don't care if we can't write all of it
uart.write(chunk).unwrap();
}
esp_hal::delay::Delay::new().delay_millis(1000);
}
} |
Sorry for not providing a smaller example, but I was not sure that I would be that easy to simplify, as I had not been able formulate a convincing theory of the root cause. But I later realised that I had misunderstood where a sleep was, and that I had failed to collect the hw-registers at all places. But I have now pushed a much simplified example with commit id: 9136ccb084dcd0f8b5b559c635672bcec384c91f. There one can see that there is increased risk of rx_fifo_cnt being corrupted if the delays after calling tx.flush() and rx.read_async() is kept to low. But would not expect that a delay should be needed at all... But I am a bit concerned that my larger example is actually running into the problem even waits a lot more. But it might be that what is needed is that is sleeps until the TX-uart has actually been quiet for some specific time. Also note that I am using RxConfig::default().with_timeout(10), as we need to be able to be triggered when a sentence is received.
|
Thanks that reproducer looks much more manageable |
Seems likely that our memory read/write impl for Xtensa in probe-rs is biting us here. We should investigate improving that. The TL;DR is that currently, when the debugger polls for new defmt messages it blocks the system bus, which often will stall the CPU if it's trying to access the bus also. |
@mattiasgronlund Could you try out #3186 and report back? |
Impressive to see the progress in esp-hal. Feels like #3186 fixed a real issue, but sadly not this one, I still see corruptions (at least when running with probe-rs and defmt). @bjoernQ I think you are partially right regarding:
But I think that is only true as long as it is a pure SW problem. But I am quite sure there is an at least not yet documented HW problem here, which to some share some communalities with 03-errata-description/esp32/uart-fifo-cnt-indicates-data-length-incorrectly, as both share the exact same symptom:
Then I am in no position at all to judge if the reason is identical, but as @MabezDev noted, it could be related to probe-rs As this very much seems like a new Errata for ESP32S3, what is the best way to approach |
I wonder why your reproducer works perfectly fine for me. I had to re-add previous wifi code for the reproducer to actually break for me. What probe-rs version do you use, and in general what environment? Are you maybe using probe-rs through usbip?
Why do you think that's the case? |
I did test it with WIFI, as our actual use case surprisingly enough include WIFI. I have no clue about what usbip is, I have connected the ESP32s3 card through a normal USB cable. It feels like an errata, as it has the exact same symptom as the existing one for ESP32, and I have not been able to find this "feature" in the documentation specific for ESP32s3. The (erratic?) feature being that UART_RXFIFO_CNT sometimes is not in sync with UART_APB_RX_RADDR and UART_RX_WADDR, but instead show a lower count than the bytes actually in the buffer. Which in esp-hal (which rightfully trust UART_RXFIFO_CNT) gets into a habit of buffering the last byte(s) until more bytes arrive. A feature which can have very surprising side effects if messages arrive sporadically with low frequency... |
The problem is, the ESP32S3 doesn't have a DPORT bus, so I would expect that problems related to it don't exist on the S3. What you see may be that your FIFO overflows so much the symptoms become "flipped". But alas, I can't reproduce that. I suspect for some reason your PC introduces a lot For the time being (i.e. for as long as we find issues to fix), I assume this is a software issue. |
But what I can not understand (which is why I suspect HW engineers are needed) is how can the HW registers be out of sync, they should be able to describe just like in the Errata the root cause for the inconsistency in the registers which SW has no write access to. Yes, but then it is best to keep having the tougher test as a reference, as as long as it fails, more fixes are needed. mattiasgronlund/esp32s3-uart-stability@fef175a Or mattiasgronlund/esp32s3-uart-stability@e3ecb65 If you patch esp-hal to provide the counters... The test is tuned to be quite nice to esp-hal as it is right now, so when it passes it is probably of value to try tuning it a bit more to see if there are more SW fixes needed to prevent the problem. |
I guess the pointers keep walking around even when the counter itself caps out at 128. The counter itself is how many bytes you did not read, but the write pointer will move around regardless if you still fail to process the FIFO. It's not that they are out of sync, it's that you assume something undocumented and that may not be what is happening. You're only looking at the pointers because of an unrelated error in a (mostly) unrelated chip, so maybe using this as the baseline for correct behaviour isn't as fruitful as you assume. This is pretty easy to verify (unless it's 11PM) by essentially writing single bytes, flushing, then not reading them to see how the counter and write pointer change, then reading without writing to see how the counter and read pointer moves, too. Throwing 128 bytes at a time at this just means we're guessing for no good reason, I think. After this, if we have something definite about the overflow behaviour of these, we can deduce how UART is supposed to work under load. |
Yes, 11.00 PM is not the right time to dig deeper into it... Let's be clear, I assume that I there will be problems when the system is under load. After an overflow event, I, correct me if I a wrong, assume that if a valid recovery procedure would be to use read_buffered(), with a buffer larger than the HW buffer, and just ignoring these bytes. After this recovery procedure I should expect that looping read_async() (with timeout configured) would provide me all bytes sent by the transmitter, shortly (100ms?) after the last bit was transmitted. If there isn't a bug in the test code, I would say that it just says that the this is not what I can expect after the recovery procedure. Do not get me wrong, I think you are doing a terrific job here, and what I have been providing is just a procedure provoking the problem. And as always, to bisect a problem you need one working case and one failing case. Based on these two cases, theories regarding how to create test cases in-between which either has or do not have the problem is the way forward to nailing down more exactly what the root cause is. So as of now, we have concluded that:
And I think the idea used here to identify how the test can be split down into simpler tests testing possible root causes, is the right way to go. I am not sure if I am able to contribute with anything for a couple of days, but I intend to using the base I have try to se if I can rework the test to instead of "scanning for edge cases" focus if the problem is possible to provoke by other means than delaying the reader solely by being down prioritised by e.g. WIFI preventing the task from calling async_read() in time, or, async_read() not being prioritised to complete its work in time. |
OK I see what you mean by desync (the read and write pointers are independent, and with an overflow they can get weird w.r.t FIFO count), I can drive the FIFO into a state where it reports a single byte in it, but reads back a different value than I last wrote. I'll see what I can do about that. |
@mattiasgronlund Do you mind trying #3190? |
Glad to hear things are working correctly now, @mattiasgronlund! Closing. edit: oops, didn't realize the PR wasn't merged yet, but should be merged shortly. |
Bug description
A few reads after
RxError::FifoOverflowed
, reads from the UART starts to produce garbage.It seems like the root cause is related to
status.rxfifo_cnt
being out of sync withmem_rx_status.apb_rx_raddr
andstatus.rx_waddr()
.The strange thing is that it works fine to read data, if it is done fast enough? But then after waiting one second (after the TX-UART has stopped sending data), the registers has become out of sync.
See
56.758957 INFO UART_0: Before reading garbage, fifo_cnt=(0, 512, 576)
in the less filtered log below, where fifo_cnt all of a sudden is out of sync with the pointers...To Reproduce
It should be possible to reproduce with the code I created to identify the problem, is expects an ESP32s3 board with one UART TX connected to all UART RX ports.
https://github.com/mattiasgronlund/esp32s3-uart-stability
CommitId: c09e61e3f7375b9640cd4bf3b1cef4197b6d04b1
This code relies on a small patch to for esp-hal to showcase the problem (Patch #1 below), and another small patch for esp-hal to showcase a prof-of-concept fix (Patch #2 below)
Expected behavior
Running the test program and piping the defmt output for PROBLEM (grep -e PROBLEM), should creating something like:
Actual behavior
In the actual behavior, one should look closer to the FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 576 }. information, where it can be seen that fifo_cnt is not in sync with apb_rx_raddr and rx_waddr.
Added more unfiltered logs at the end where one can see more exactly when they got out of sync...
Example:
I expect that after reception
RxError::FifoOverflowed
and as in the example:The data later written by the writer is received unaltered by the receiver.
Patches
Patch #1
Patch #2
Environment
Less filtered logs
The text was updated successfully, but these errors were encountered: