Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Testing of ConcurrentCache via logging #960

Draft
wants to merge 52 commits into
base: memory_cache
Choose a base branch
from

Conversation

veloman-yunkan
Copy link
Collaborator

This is my work in pursue of #956 (comment)

I haven't started the redesign of ConcurrentCache yet. I am creating this PR so that you can see my approach to the testing of multithreaded flows and provide some early feedback while I keep working on the unit-tests of the enhanced concurrent cache and its clean-up.

No new unit tests were added - merely the sole not so important
unit-test related to ConcurrentCache was moved to the new test file.
NamedThread::getCurrentThreadName() didn't work correctly if a new
NamedThread was created after an existing NamedThread object was joined
(the problem came from the join()-ed thread's id being reused for the
new thread).
It fails since orchestration of concurrent execution via logging is not
yet implemented.
Up to this point the purpose was to implement orchestration of
concurrent execution via logging. It was done entirely in the unit test
file test/log.cpp with the purpose of minimizing the build+test cycle
(modifying the src/log.h header would trigger recompilation of much more
stuff). Now that the code is (almost) ready, it has to be moved into
the right header and source files. This is the first step.
With NamedThread now having a dedicated .cpp file, its static
data members can be hidden there in an unnamed namespace, taking
the associated standard library headers with them.
Added testing of log_debug_func_call() in multithreaded environment.

The new test sporadically fails due to some not yet understood race
condition (probably coming since orchestration was added). Here
is an example failure:

[ RUN      ] Log.nestedFunctionCallsMultithreaded
../../../../home/leon/freelancing/kiwix/builddir/SOURCE/libzim/test/log.cpp:243: Failure
Expected equality of these values:
  Logging::getInMemLogContent()
    Which is: "thread#2: Calling bar(qwerty)...\nthread#1: Calling bar(......"
  targetOutput
    Which is: "thread#2: Calling bar(qwerty)...\nthread#1: Calling bar(......"
With diff:
@@ -7,5 +7,6 @@
 thread#2:  foo(\"qwerty\", 1) {
 thread#2:   fooing qwerty
-thread#1:  foo(\"asdf\", thread#2: 1) {} (return value: 6)
+thread#1:  foo(\"asdf\", 1) {
+thread#2:  } (return value: 6)
 thread#1:   fooing asdf
 thread#1:  } (return value: 4)

[  FAILED  ] Log.nestedFunctionCallsMultithreaded (1 ms)
Note that string values are recorded in FunctionCallLogger::returnValue_
quoted with double quotes, thus an empty string return value doesn't
confuse the logic of this change.
class Logging now has a clean public API. All the stuff that is
needed to support the implementation of the log_debug*() macros was
moved into a separate namespace LoggingImpl.
... from LoggingImpl::FunctionCallLogger class to LoggingImpl namespace.

Also moved the definition of the non-template variants of those
functions to the .cpp file.
This should have been done in an earlier commit.
@mgautierfr
Copy link
Collaborator

(I only have a global look, not a thorough review of each commits)

Testing that something works correctly through logging seems a bit odd to me.
Log should be to get information about what happens, not to test that something happens in the correct order. A best we could call this tracing. It is a small semantic subtlety but I had to express it.

I am also uncomfortable with the orchestration. I was about to ask you how you handle the unpredictability of the output ordering caused by multithreading until I realize you orchestrate the thread based on the expected output. But doing so, you "linearize" the code workflow. The log system is locking threads to ensure that only one thread can log at the same time. So testing is changing the behavior of what is tested (mostly, locks in the concurrent cache become somehow useless) and we are not testing what will run in production.

@veloman-yunkan
Copy link
Collaborator Author

I am also uncomfortable with the orchestration. I was about to ask you how you handle the unpredictability of the output ordering caused by multithreading until I realize you orchestrate the thread based on the expected output. But doing so, you "linearize" the code workflow.

That is exactly the trick. Each instantiation of a mutlithreaded execution is equivalent to some linearization of it (assuming that there are no race conditions, i.e. access to shared resources is properly protected). Then I can define in the tests the linearizations that the code should handle.

The log system is locking threads to ensure that only one thread can log at the same time. So testing is changing the behavior of what is tested (mostly, locks in the concurrent cache become somehow useless) and we are not testing what will run in production.

Consider the situation that you currently have in ConcurrentCache - in a comment in ConcurrentCache::getOrPut() you talk in length about a rare situation that may happen during a tiny time window. My approach allows to simulate that scenario - we only need to put a log/trace statement in the said window and, based on our understanding of how that scenario should work, generate the log manually. The test's job will be to validate that the code indeed executes along that path.

More generally, tests for concurrent flows must be written so as to cover various linearizations. In a TDD approach, such tests can help producing the right code for concurrency management.

You are right that the approach lacks means for detecting absence of correct synchronization between threads. Thanks for pointing this out! I was mostly focused on testing the logic governing the execution under different scenarios, assuming that all measures against race conditions are in place. It should still allows to catch bugs leading to deadlocks or livelocks. I will think how that approach (possibly, after some enhancement) can be utilized to catch issues with potential race conditions.

@mgautierfr
Copy link
Collaborator

Consider the situation that you currently have in ConcurrentCache - in a comment in ConcurrentCache::getOrPut() you talk in length about a rare situation that may happen during a tiny time window. My approach allows to simulate that scenario - we only need to put a log/trace statement in the said window and, based on our understanding of how that scenario should work, generate the log manually. The test's job will be to validate that the code indeed executes along that path.

More generally, tests for concurrent flows must be written so as to cover various linearizations. In a TDD approach, such tests can help producing the right code for concurrency management.

Said like that I agree with you. But then I would prefer a proper name (as test_sync_barrier) instead of log. Especially as this will be noop at runtime and the purpose is to modify the behavior when we are testing.

This has a small user-observable change in behavior. Previously
lru_cache::setMaxCost() might leave the cache empty. Currently, at least
one (most-recently used) item is preserved in the cache (even if its
cost exceeds the set limit). This is consistent with adding that same
item to an empty cache.
The changes in the unit tests are better understood if whitespace
changes are ignored (git diff -w).
This test too intermittently fails (likely to the suspected but not yet
tracked-down race condition in logging and/or namedthread code). This
time however it fails either due to a deadlock, or obvious memory corruption
like below:

[ RUN      ] ConcurrentCacheMultithreadedTest.accessSameExistingItem
../../../../home/leon/freelancing/kiwix/builddir/SOURCE/libzim/test/concurrentcache.cpp:429: Failure
Expected equality of these values:
  zim::Logging::getInMemLogContent()
    Which is: "thread#0: Output of interest starts from the next line\na  : ConcurrentCache::getOrPut(5) {\n  b: ConcurrentCache::getOrPut(5) {\na  :  ConcurrentCache::getCacheSlot(5) {\na  :   entered synchronized section\n  b:  ConcurrentCache::getCacheSlot(5) {\na  :   lru_cache::getOrPut(5) {\na  :    already in cache, moved to the beginning of the LRU list.\na  :   }\na  :   exiting synchronized section\na  :  }\n  b:   entered synchronized section\na  :  Obtained the cache slot\n  b:   lru_cache::getOrPut(5) {\na  : } (return value: 50)\n\0 b:    already in cache, moved to the beginning of the LRU list.\n  b:   }\n  b:   exiting synchronized section\n  b:  }\n  b:  Obtained the cache slot\n  b: } (return value: 50)\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 "
  targetOutput
    Which is: "thread#0: Output of interest starts from the next line\na  : ConcurrentCache::getOrPut(5) {\n  b: ConcurrentCache::getOrPut(5) {\na  :  ConcurrentCache::getCacheSlot(5) {\na  :   entered synchronized section\n  b:  ConcurrentCache::getCacheSlot(5) {\na  :   lru_cache::getOrPut(5) {\na  :    already in cache, moved to the beginning of the LRU list.\na  :   }\na  :   exiting synchronized section\na  :  }\n  b:   entered synchronized section\na  :  Obtained the cache slot\n  b:   lru_cache::getOrPut(5) {\na  : } (return value: 50)\n  b:    already in cache, moved to the beginning of the LRU list.\n  b:   }\n  b:   exiting synchronized section\n  b:  }\n  b:  Obtained the cache slot\n  b: } (return value: 50)\n"
With diff:
@@ -14,5 +14,5 @@
   b:   lru_cache::getOrPut(5) {
 a  : } (return value: 50)
-\0 b:    already in cache, moved to the beginning of the LRU list.
+  b:    already in cache, moved to the beginning of the LRU list.
   b:   }
   b:   exiting synchronized section
@@ -19,4 +19,3 @@
   b:  }
   b:  Obtained the cache slot
-  b: } (return value: 50)
-\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
+  b: } (return value: 50)\n

[  FAILED  ] ConcurrentCacheMultithreadedTest.accessSameExistingItem (2 ms)
@veloman-yunkan veloman-yunkan force-pushed the memory_cache_veloman_version branch from 47181b2 to 5a114d5 Compare March 11, 2025 18:16
Without this fix there were 107 failures out of 700 runs of the
ConcurrentCacheMultithreadedTest.accessSameExistingItem unit test
(with timeout value of 100 ms). The fix brought the failure count
to 0.
Diff much simpler if whitespaces changes are ignored (git diff -w)
If an item that is in the process of being added to the cache
is requested from the cache within the time window when it has
been (temporarily) evicted from the cache (as a result of the
concurrent turmoil), a concurrent evaluation of the same item
is started while the previous one is still in progress.
An item being added to the cache is never evicted due to concurrent
turmoil. LRU eviction heuristics was changed to only evict items
with non-zero cost.
@veloman-yunkan veloman-yunkan force-pushed the memory_cache_veloman_version branch from e229738 to e34bd71 Compare March 12, 2025 13:51
@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr I am done with the preliminary version of my contribution to your PR. I will have to clean it up significantly. But its second part (focusing on the testing of ConcurrentCache, starting from 6aa5aee) will be almost unaffected, so I invite you to look at it from there.

@veloman-yunkan
Copy link
Collaborator Author

Said like that I agree with you. But then I would prefer a proper name (as test_sync_barrier) instead of log. Especially as this will be noop at runtime and the purpose is to modify the behavior when we are testing.

Syncrhonization is only one of the uses of logging/tracing. Currently it is done with two logging statements (before and after the mutex locking operation) that must kept together in the target log of the simulated flow. A test_sync_barrier() macro may help to avoid the pitfall of trying to achieve that effect with a single simple (general purpose) logging statement. log_debug_raii_sync_statement() is what can be used for that purpose.

But logging also helps to visualize the operation of what is otherwise difficult to tell (for example, absence of double-work addressed by the last commit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants