Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp32s3: UART rx_fifo_count() produces garbage after RxError::FifoOverflowed. #3168

Closed
mattiasgronlund opened this issue Feb 23, 2025 · 17 comments · Fixed by #3190
Closed

esp32s3: UART rx_fifo_count() produces garbage after RxError::FifoOverflowed. #3168

mattiasgronlund opened this issue Feb 23, 2025 · 17 comments · Fixed by #3190
Assignees
Labels
bug Something isn't working

Comments

@mattiasgronlund
Copy link

Bug description

A few reads after RxError::FifoOverflowed, reads from the UART starts to produce garbage.

It seems like the root cause is related to status.rxfifo_cnt being out of sync with
mem_rx_status.apb_rx_raddr and status.rx_waddr().

The strange thing is that it works fine to read data, if it is done fast enough? But then after waiting one second (after the TX-UART has stopped sending data), the registers has become out of sync.
See 56.758957 INFO UART_0: Before reading garbage, fifo_cnt=(0, 512, 576) in the less filtered log below, where fifo_cnt all of a sudden is out of sync with the pointers...

To Reproduce

It should be possible to reproduce with the code I created to identify the problem, is expects an ESP32s3 board with one UART TX connected to all UART RX ports.

https://github.com/mattiasgronlund/esp32s3-uart-stability
CommitId: c09e61e3f7375b9640cd4bf3b1cef4197b6d04b1

This code relies on a small patch to for esp-hal to showcase the problem (Patch #1 below), and another small patch for esp-hal to showcase a prof-of-concept fix (Patch #2 below)

Expected behavior

Running the test program and piping the defmt output for PROBLEM (grep -e PROBLEM), should creating something like:

42.937568 ERROR UART_0: PROBLEM FifoOverflowed
102.028365 ERROR UART_0: PROBLEM FifoOverflowed

Actual behavior

In the actual behavior, one should look closer to the FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 576 }. information, where it can be seen that fifo_cnt is not in sync with apb_rx_raddr and rx_waddr.

Added more unfiltered logs at the end where one can see more exactly when they got out of sync...

Example:

55.458083 ERROR UART_0: PROBLEM FifoOverflowed
57.932251 ERROR UART_0: PROBLEM [0] 56.861510..57.867114 Δ1.005604s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 576 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 576, rx_waddr: 512 }, len=064 [] C0<00> [C1, C2, C3, C4, C5, C6, C7, C8, C9, CA, CB, CC, CD, CE, CF, D0, D1, D2, D3, D4, D5, D6, D7, D8, D9, DA, DB, DC, DD, DE, DF, E0, E1, E2, E3, E4, E5, E6, E7, E8, E9, EA, EB, EC, ED, EE, EF, F0, F1, F2, F3, F4, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, FF]
60.450043 ERROR UART_0: PROBLEM [0] 59.337644..60.370312 Δ1.032668s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 576, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 528, rx_waddr: 592 }, len=080 [] 40<00> [41, 42, 43, 44, 45, 46, 47, 48, 49, 4A, 4B, 4C, 4D, 4E, 4F, D0, D1, D2, D3, D4, D5, D6, D7, D8, D9, DA, DB, DC, DD, DE, DF, E0, E1, E2, E3, E4, E5, E6, E7, E8, E9, EA, EB, EC, ED, EE, EF, F0, F1, F2, F3, F4, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, FF, 00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F]
62.936473 ERROR UART_0: PROBLEM [0] 61.864065..62.869529 Δ1.005464s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 528, rx_waddr: 576 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 592, rx_waddr: 512 }, len=064 [] D0<00> [D1, D2, D3, D4, D5, D6, D7, D8, D9, DA, DB, DC, DD, DE, DF, E0, E1, E2, E3, E4, E5, E6, E7, E8, E9, EA, EB, EC, ED, EE, EF, F0, F1, F2, F3, F4, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, FF, 00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F]

I expect that after reception RxError::FifoOverflowed and as in the example:

  • The writer has stopped writing
  • The reader has read out all buffed data using read_buffered() until no more data is returned
    The data later written by the writer is received unaltered by the receiver.

Patches

Patch #1

diff --git a/esp-hal/src/uart.rs b/esp-hal/src/uart.rs
index ff56451e..b73bb168 100644
--- a/esp-hal/src/uart.rs
+++ b/esp-hal/src/uart.rs
@@ -931,6 +931,17 @@ where
         fifo_cnt
     }
 
+    /// Debug output
+    #[allow(clippy::useless_conversion)]
+    pub fn rx_fifo_count_debug(&self) -> (u16, u16, u16) {
+        let fifo_cnt: u16 = self.regs().status().read().rxfifo_cnt().bits().into();
+
+        let status = self.regs().mem_rx_status().read();
+        let rd_addr = status.apb_rx_raddr().bits();
+        let wr_addr = status.rx_waddr().bits();
+        return (fifo_cnt, rd_addr, wr_addr)
+    }
+

     /// Disables all RX-related interrupts for this UART instance.
     ///
     /// This function clears and disables the `receive FIFO full` interrupt,

Patch #2

diff --git a/esp-hal/src/uart.rs b/esp-hal/src/uart.rs
index ff56451e..d0d1a1a4 100644
--- a/esp-hal/src/uart.rs
+++ b/esp-hal/src/uart.rs
@@ -927,10 +927,38 @@ where
             }
         }
 
-        #[cfg(not(esp32))]
+        #[cfg(esp32s3)]
+        {
+            let status = self.regs().mem_rx_status().read();
+            let rd_addr = status.apb_rx_raddr().bits();
+            let wr_addr = status.rx_waddr().bits();
+
+            if wr_addr > rd_addr {
+                wr_addr - rd_addr
+            } else if wr_addr < rd_addr {
+                (wr_addr + Info::UART_FIFO_SIZE) - rd_addr
+            } else if fifo_cnt > 0 {
+                Info::UART_FIFO_SIZE
+            } else {
+                0
+            }
+        }
+
+        #[cfg(not(any(esp32, esp32s3)))]
         fifo_cnt
     }
 
+    /// Debug output
+    #[allow(clippy::useless_conversion)]
+    pub fn rx_fifo_count_debug(&self) -> (u16, u16, u16) {
+        let fifo_cnt: u16 = self.regs().status().read().rxfifo_cnt().bits().into();
+
+        let status = self.regs().mem_rx_status().read();
+        let rd_addr = status.apb_rx_raddr().bits();
+        let wr_addr = status.rx_waddr().bits();
+        return (fifo_cnt, rd_addr, wr_addr)
+    }
+
     /// Disables all RX-related interrupts for this UART instance.
     ///
     /// This function clears and disables the `receive FIFO full` interrupt,

Environment

  • Target device: esp32s3 (revision v0.1)
  • Crate name and version: esp-hal 4d07575

Less filtered logs

55.357734 INFO  UART_0: HISTORY [0] 54.180470..55.186348 Δ1.005878s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 576, rx_waddr: 576 }, len=064 [00..3F]
55.458040 INFO  UART_0: HISTORY [1] 55.186355..55.191930 Δ0.005575s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 576, rx_waddr: 576 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, len=064 [40..7F]
55.458083 ERROR UART_0: PROBLEM FifoOverflowed
55.558242 INFO  UART_0: FUTURE  [3] 55.257036..55.257061 Δ0.000025s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 577, rx_waddr: 577 }, len=000 []
55.658422 INFO  UART_0: FUTURE  [4] 55.257077..55.257105 Δ0.000028s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 577, rx_waddr: 577 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, len=096 [C0..80], [81..9F]
55.758558 INFO  UART_0: FUTURE  [5] 55.257121..55.257144 Δ0.000023s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 578, rx_waddr: 578 }, len=032 [A0..BF]
55.758626 INFO  UART_0: Sleeping 1.000s, before reading out garbage...
56.758957 INFO  UART_0: Before reading garbage, fifo_cnt=(0, 512, 576)
56.860942 INFO  UART_0: Read zero bytes of garbage, 11 times, hope this is enough. len=0 []
56.860994 INFO  UART_0: After reading garbage, fifo_cnt=(0, 512, 576)

...

56.861245 INFO  UART_TX: Start sending bytes in 1.000s...
57.861441 INFO  UART_TX: Write: [0..128], [00..7F]
57.861608 INFO  UART_TX: Sleep 0.010s
57.871690 INFO  UART_TX: Write: [0..192], [00..BF]
57.889328 INFO  UART_TX: Sleep 0.010s
57.899363 INFO  UART_TX: Write: [0..256], [00..FF]
57.921600 INFO  UART_TX: Sleep 0.010s
57.931626 INFO  UART_TX: Finished sending bytes, buffered flushed.

...

57.932251 ERROR UART_0: PROBLEM [0] 56.861510..57.867114 Δ1.005604s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 576 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 576, rx_waddr: 512 }, len=064 [] C0<00> [C1, C2, C3, C4, C5, C6, C7, C8, C9, CA, CB, CC, CD, CE, CF, D0, D1, D2, D3, D4, D5, D6, D7, D8, D9, DA, DB, DC, DD, DE, DF, E0,
 E1, E2, E3, E4, E5, E6, E7, E8, E9, EA, EB, EC, ED, EE, EF, F0, F1, F2, F3, F4, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, FF]
57.932483 INFO  UART_TX: SIG Signal TX not ready
57.932523 INFO  UART_TX: SIG Wait for RX ready
57.932542 INFO  UART_TX: SIG Signal TX not ready
57.932555 INFO  UART_TX: SIG Wait for RX ready
58.032458 INFO  UART_0: FUTURE  [1] 57.931989..57.932014 Δ0.000025s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 576, rx_waddr: 576 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, len=000 []
58.133996 INFO  UART_0: FUTURE  [2] 57.932030..57.932058 Δ0.000028s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 128, apb_rx_raddr: 512, rx_waddr: 516 }, len=096 [C0..80], [81..9F]
58.234136 INFO  UART_0: FUTURE  [3] 57.932074..57.932097 Δ0.000023s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 577, rx_waddr: 577 }, len=032 [A0..BF]
58.234188 INFO  UART_0: Sleeping 1.000s, before reading out garbage...
59.234257 INFO  UART_0: Before reading garbage, fifo_cnt=(0, 576, 512)
59.337212 INFO  UART_0: Read zero bytes of garbage, 11 times, hope this is enough. len=0 []
59.337248 INFO  UART_0: After reading garbage, fifo_cnt=(0, 576, 512)

...

59.337482 INFO  UART_TX: Start sending bytes in 1.000s...
60.363165 INFO  UART_TX: Write: [0..128], [00..7F]
60.374645 INFO  UART_TX: Sleep 0.010s
60.388090 INFO  UART_TX: Write: [0..192], [00..BF]
60.404959 INFO  UART_TX: Sleep 0.010s
60.415192 INFO  UART_TX: Write: [0..256], [00..FF]
60.438957 INFO  UART_TX: Sleep 0.010s
60.449234 INFO  UART_TX: Finished sending bytes, buffered flushed.

...

60.450043 ERROR UART_0: PROBLEM [0] 59.337644..60.370312 Δ1.032668s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 576, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 528, rx_waddr: 592 }, len=080 [] 40<00> [41, 42, 43, 44, 45, 46, 47, 48, 49, 4A, 4B, 4C, 4D, 4E, 4F, D0, D1, D2, D3, D4, D5, D6, D7, D8, D9, DA, DB, DC, DD, DE, DF, E0, E1, E2, E3, E4, E5, E6, E7, E8, E9, EA, EB, EC, ED, EE, EF, F0, F1, F2, F3, F4, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, FF, 00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F]
60.450678 INFO  UART_TX: SIG Signal TX not ready
60.450695 INFO  UART_TX: SIG Wait for RX ready
60.450715 INFO  UART_TX: SIG Signal TX not ready
60.450728 INFO  UART_TX: SIG Wait for RX ready
60.553286 INFO  UART_0: FUTURE  [1] 60.449736..60.449780 Δ0.000044s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 576, rx_waddr: 576 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, len=000 []
60.653651 INFO  UART_0: FUTURE  [2] 60.449796..60.449832 Δ0.000036s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 128, apb_rx_raddr: 512, rx_waddr: 516 }, len=096 [D0..80], [81..AF]
60.755710 INFO  UART_0: FUTURE  [3] 60.449848..60.449880 Δ0.000032s before:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 512, rx_waddr: 512 }, after:FifoCnt { fifo_cnt: 0, apb_rx_raddr: 577, rx_waddr: 577 }, len=032 [B0..CF]
60.755787 INFO  UART_0: Sleeping 1.000s, before reading out garbage...
61.755912 INFO  UART_0: Before reading garbage, fifo_cnt=(0, 528, 576)
61.863452 INFO  UART_0: Read zero bytes of garbage, 11 times, hope this is enough. len=0 []
61.863534 INFO  UART_0: After reading garbage, fifo_cnt=(0, 528, 576)
@mattiasgronlund mattiasgronlund added bug Something isn't working status:needs-attention This should be prioritized labels Feb 23, 2025
@bjoernQ
Copy link
Contributor

bjoernQ commented Feb 24, 2025

Thanks! There might be something - especially about the assumed and actual behavior

If the hardware gets "out of sync" here it should be reproducible with relatively simple synchronous code I guess but I wasn't able to force this

I expect that after reception RxError::FifoOverflowed and as in the example:

  • The writer has stopped writing

I haven't looked too much into the code (it's quite a lot) but if the code isn't doing anything special TX and RX are more or less independent. Without any flow-control TX will happily continue sending data - on the RX side the overflow flag will be set but otherwise it will continue to receive data

This non-async code shows it received what was sent (minus the overflowed data)

//% CHIPS: esp32c2 esp32c3 esp32c6 esp32h2 esp32s2 esp32s3
//% FEATURES: embassy esp-hal/unstable

#![no_std]
#![no_main]

use embedded_io::Write;
use esp_backtrace as _;
use esp_hal::uart::Uart;
use esp_println::println;

#[esp_hal::main]
fn main() -> ! {
    esp_println::println!("Init!");
    let peripherals = esp_hal::init(esp_hal::Config::default());

    let mut uart = Uart::new(peripherals.UART1, esp_hal::uart::Config::default())
        .unwrap()
        .with_tx(peripherals.GPIO4)
        .with_rx(peripherals.GPIO5);



    // write s.th. to the buffer to force an rx-overflow
    let mut buf = [0;70];
    for i in 0..buf.len() { buf[i] = i as u8; }
    uart.write_all(&buf).ok();
    esp_hal::delay::Delay::new().delay_millis(500);

    // write even more w/o reading
    let mut buf = [0;70];
    for i in 0..buf.len() { buf[i] = i as u8; }
    uart.write_all(&buf).ok();
    esp_hal::delay::Delay::new().delay_millis(500);

    let mut buf = [0u8; 256];
    for i in 0..buf.len() {
        buf[i] = (i % 255) as u8;
    }

    loop {
        for chunk in buf.chunks(64) {
            // read what we can
            for _ in 0..1000 {
                let mut rbuf = [0u8; 1024];
                let res = uart.read_buffered(&mut rbuf);
                match res {
                    Ok(res) => {
                        if res > 0 {
                            println!("Received: {:?}", &rbuf[..res]);
                        }
                    }
                    Err(e) => {
                        println!("{:?}", e);
                    }
                }
            }

            // (try) write a chunk of data - don't care if we can't write all of it
            uart.write(chunk).unwrap();
        }
        esp_hal::delay::Delay::new().delay_millis(1000);
    }
}

@mattiasgronlund
Copy link
Author

Sorry for not providing a smaller example, but I was not sure that I would be that easy to simplify, as I had not been able formulate a convincing theory of the root cause. But I later realised that I had misunderstood where a sleep was, and that I had failed to collect the hw-registers at all places.

But I have now pushed a much simplified example with commit id: 9136ccb084dcd0f8b5b559c635672bcec384c91f.

There one can see that there is increased risk of rx_fifo_cnt being corrupted if the delays after calling tx.flush() and rx.read_async() is kept to low. But would not expect that a delay should be needed at all...

But I am a bit concerned that my larger example is actually running into the problem even waits a lot more. But it might be that what is needed is that is sleeps until the TX-uart has actually been quiet for some specific time.

Also note that I am using RxConfig::default().with_timeout(10), as we need to be able to be triggered when a sentence is received.

18.444127 INFO  Start test with 0.006900 seconds of sleep after tx.flush().
19.042206 INFO  Success rate 31/31
19.042216 INFO  Start test with 0.006800 seconds of sleep after tx.flush().
19.643603 ERROR Expected lenght 64 on UART_1, got: len=66
19.643619 INFO  Success rate 32/32
19.643629 INFO  Start test with 0.006700 seconds of sleep after tx.flush().
20.252229 ERROR Corrupted data in buffer: 
[0C, 0D, 0E, 0F, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1A, 1B, 1C, 1D, 1E, 1F, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 2A, 2B, 2C, 2D, 2E, 2F, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 3A, 3B, 3C, 3D, 3E, 3F, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, C0, C1] != 
[0A, 0B, 0C, 0D, 0E, 0F, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1A, 1B, 1C, 1D, 1E, 1F, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 2A, 2B, 2C, 2D, 2E, 2F, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 3A, 3B, 3C, 3D, 3E, 3F, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
20.252305 INFO  Success rate 32/33
20.252315 INFO  Start test with 0.006600 seconds of sleep after tx.flush().
20.861858 ERROR Corrupted data in buffer: 
[0C, 0D, 0E, 0F, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1A, 1B, 1C, 1D, 1E, 1F, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 2A, 2B, 2C, 2D, 2E, 2F, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 3A, 3B, 3C, 3D, 3E, 3F, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, C0, C1] != 
[0A, 0B, 0C, 0D, 0E, 0F, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1A, 1B, 1C, 1D, 1E, 1F, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 2A, 2B, 2C, 2D, 2E, 2F, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 3A, 3B, 3C, 3D, 3E, 3F, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
20.861927 INFO  Success rate 32/34

@mattiasgronlund mattiasgronlund changed the title esp32s3: UART tx_fifo_count() produces garbage after RxError::FifoOverflowed. esp32s3: UART rx_fifo_count() produces garbage after RxError::FifoOverflowed. Feb 25, 2025
@bjoernQ
Copy link
Contributor

bjoernQ commented Feb 25, 2025

Thanks that reproducer looks much more manageable

@bjoernQ
Copy link
Contributor

bjoernQ commented Feb 25, 2025

I tried your simplified code (changed pins and adapted to 1.0.0-beta.0, commented out the debug output) and I see some corruption - but apparently later

Image

I also tried it with println instead of defmt and I don't see corruption

Image

@MabezDev
Copy link
Member

Seems likely that our memory read/write impl for Xtensa in probe-rs is biting us here. We should investigate improving that. The TL;DR is that currently, when the debugger polls for new defmt messages it blocks the system bus, which often will stall the CPU if it's trying to access the bus also.

@MabezDev MabezDev added timing Issue related to timing and removed status:needs-attention This should be prioritized labels Feb 26, 2025
@bugadani bugadani self-assigned this Feb 27, 2025
@MabezDev
Copy link
Member

@mattiasgronlund Could you try out #3186 and report back?

@mattiasgronlund
Copy link
Author

Impressive to see the progress in esp-hal.

Feels like #3186 fixed a real issue, but sadly not this one, I still see corruptions (at least when running with probe-rs and defmt).

@bjoernQ I think you are partially right regarding:

If the hardware gets "out of sync" here it should be reproducible with relatively simple synchronous code I guess but I wasn't able to force this

But I think that is only true as long as it is a pure SW problem. But I am quite sure there is an at least not yet documented HW problem here, which to some share some communalities with 03-errata-description/esp32/uart-fifo-cnt-indicates-data-length-incorrectly, as both share the exact same symptom:

When software uses DPORT to read UART fifo_cnt, and such operation is interrupted, then fifo_cnt will decrement by 1 erroneously.

Then I am in no position at all to judge if the reason is identical, but as @MabezDev noted, it could be related to probe-rs that currently, when the debugger polls for new defmt messages it blocks the system bus, which often will stall the CPU if it's trying to access the bus also.

As this very much seems like a new Errata for ESP32S3, what is the best way to approach
Espressif Systems for their take on it and advice?

@bugadani
Copy link
Contributor

I wonder why your reproducer works perfectly fine for me. I had to re-add previous wifi code for the reproducer to actually break for me. What probe-rs version do you use, and in general what environment? Are you maybe using probe-rs through usbip?

As this very much seems like a new Errata for ESP32S3

Why do you think that's the case?

@mattiasgronlund
Copy link
Author

mattiasgronlund commented Feb 27, 2025

I did test it with WIFI, as our actual use case surprisingly enough include WIFI.

I have no clue about what usbip is, I have connected the ESP32s3 card through a normal USB cable.

It feels like an errata, as it has the exact same symptom as the existing one for ESP32, and I have not been able to find this "feature" in the documentation specific for ESP32s3. The (erratic?) feature being that UART_RXFIFO_CNT sometimes is not in sync with UART_APB_RX_RADDR and UART_RX_WADDR, but instead show a lower count than the bytes actually in the buffer. Which in esp-hal (which rightfully trust UART_RXFIFO_CNT) gets into a habit of buffering the last byte(s) until more bytes arrive. A feature which can have very surprising side effects if messages arrive sporadically with low frequency...

@bugadani
Copy link
Contributor

bugadani commented Feb 27, 2025

The problem is, the ESP32S3 doesn't have a DPORT bus, so I would expect that problems related to it don't exist on the S3. What you see may be that your FIFO overflows so much the symptoms become "flipped". But alas, I can't reproduce that.

I suspect for some reason your PC introduces a lot more longer interruptions than mine does. This is why adding wifi was enough for me previously to reproduce the issue and send in #3186, and this is why the probe-rs version would be useful, so that I can try that, too. I'll also try tomorrow if a slower laptop, or even a raspberry pi as the host is enough to let me see something incorrect.

For the time being (i.e. for as long as we find issues to fix), I assume this is a software issue.

@mattiasgronlund
Copy link
Author

But what I can not understand (which is why I suspect HW engineers are needed) is how can the HW registers be out of sync, they should be able to describe just like in the Errata the root cause for the inconsistency in the registers which SW has no write access to.

Yes, but then it is best to keep having the tougher test as a reference, as as long as it fails, more fixes are needed.

mattiasgronlund/esp32s3-uart-stability@fef175a

Or

mattiasgronlund/esp32s3-uart-stability@e3ecb65

If you patch esp-hal to provide the counters...

The test is tuned to be quite nice to esp-hal as it is right now, so when it passes it is probably of value to try tuning it a bit more to see if there are more SW fixes needed to prevent the problem.

@bugadani
Copy link
Contributor

I guess the pointers keep walking around even when the counter itself caps out at 128. The counter itself is how many bytes you did not read, but the write pointer will move around regardless if you still fail to process the FIFO. It's not that they are out of sync, it's that you assume something undocumented and that may not be what is happening. You're only looking at the pointers because of an unrelated error in a (mostly) unrelated chip, so maybe using this as the baseline for correct behaviour isn't as fruitful as you assume.

This is pretty easy to verify (unless it's 11PM) by essentially writing single bytes, flushing, then not reading them to see how the counter and write pointer change, then reading without writing to see how the counter and read pointer moves, too. Throwing 128 bytes at a time at this just means we're guessing for no good reason, I think.

After this, if we have something definite about the overflow behaviour of these, we can deduce how UART is supposed to work under load.

@mattiasgronlund
Copy link
Author

Yes, 11.00 PM is not the right time to dig deeper into it...

Let's be clear, I assume that I there will be problems when the system is under load.
But I also think that the recovery from that problem should be understood.

After an overflow event, I, correct me if I a wrong, assume that if a valid recovery procedure would be to use read_buffered(), with a buffer larger than the HW buffer, and just ignoring these bytes.

After this recovery procedure I should expect that looping read_async() (with timeout configured) would provide me all bytes sent by the transmitter, shortly (100ms?) after the last bit was transmitted.

If there isn't a bug in the test code, I would say that it just says that the this is not what I can expect after the recovery procedure.

Do not get me wrong, I think you are doing a terrific job here, and what I have been providing is just a procedure provoking the problem. And as always, to bisect a problem you need one working case and one failing case.

Based on these two cases, theories regarding how to create test cases in-between which either has or do not have the problem is the way forward to nailing down more exactly what the root cause is.

So as of now, we have concluded that:

And I think the idea used here to identify how the test can be split down into simpler tests testing possible root causes, is the right way to go.

I am not sure if I am able to contribute with anything for a couple of days, but I intend to using the base I have try to se if I can rework the test to instead of "scanning for edge cases" focus if the problem is possible to provoke by other means than delaying the reader solely by being down prioritised by e.g. WIFI preventing the task from calling async_read() in time, or, async_read() not being prioritised to complete its work in time.

@bugadani
Copy link
Contributor

OK I see what you mean by desync (the read and write pointers are independent, and with an overflow they can get weird w.r.t FIFO count), I can drive the FIFO into a state where it reports a single byte in it, but reads back a different value than I last wrote. I'll see what I can do about that.

@MabezDev
Copy link
Member

@mattiasgronlund Do you mind trying #3190?

@mattiasgronlund
Copy link
Author

mattiasgronlund commented Feb 28, 2025

You are just amazing!

I am testing with change #3190, change 8? 998f5b3 and it seems to run flawlessly!

I was so happy to see that you did not just think about ESP32s3, but actually thought about how this could impact the other ESP32 series.

@MabezDev
Copy link
Member

MabezDev commented Mar 3, 2025

Glad to hear things are working correctly now, @mattiasgronlund! Closing.

edit: oops, didn't realize the PR wasn't merged yet, but should be merged shortly.

@MabezDev MabezDev closed this as completed Mar 3, 2025
@MabezDev MabezDev removed the timing Issue related to timing label Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants